# **Unveiling Customer Personas: A K-Means Clustering Approach for Strategic Retail Marketing**

---

**Project Overview & Context**

This project explores the application of unsupervised machine learning, specifically K-Means clustering, to perform customer segmentation for a retail company. It was undertaken as a capstone project for the **MIT Data Science and Machine Learning Program**. While the business case and dataset may have been used in other instructional contexts, this analysis, including all preprocessing, exploratory data analysis (EDA), modeling, and interpretation, was independently developed.

**Business Context**

Understanding customer personality and behavior is pivotal for businesses aiming to enhance customer satisfaction and drive revenue growth. Segmentation—based on a customer's demographics, lifestyle, and purchasing behavior—allows companies to move beyond generic strategies. It enables the creation of tailored marketing campaigns, improves customer retention by identifying and nurturing high-value groups, and helps optimize product offerings and resource allocation.

A leading retail company with a rapidly growing customer base seeks to gain deeper insights into their customers' profiles. The company recognizes that understanding customer personalities, lifestyles, and purchasing habits can unlock significant opportunities for personalizing marketing strategies and creating loyalty programs. These insights can help address critical business challenges, such as improving the effectiveness of marketing campaigns, identifying high-value customer groups, and fostering long-term relationships with customers.

With competition intensifying in the retail space, a shift from generalized approaches to more targeted and personalized engagement is essential for sustaining a competitive edge.

---

**Project Objective**

In an effort to optimize marketing efficiency and enhance the overall customer experience, this project embarks on a mission to identify distinct customer segments within the provided dataset. By understanding the unique characteristics, preferences, and behaviors of each identified group, the primary aims are to provide a data-driven foundation for the company to:

1.  Develop personalized marketing campaigns designed to increase conversion rates.
2.  Formulate effective retention strategies specifically for high-value customers.
3.  Optimize resource allocation, potentially informing decisions related to inventory management, pricing strategies, and even store layouts.

As the data scientist for this project, my responsibility is to meticulously analyze the customer data, apply K-Means clustering to segment the customer base, and ultimately provide actionable insights and clearly defined personas for each segment.  

---

#### **Data Dictionary**  
The dataset includes historical data on customer demographics, personality traits, and purchasing behaviors. Key attributes are:  

1. **Customer Information**  
   - **ID:** Unique identifier for each customer.  
   - **Year_Birth:** Customer's year of birth.  
   - **Education:** Education level of the customer.  
   - **Marital_Status:** Marital status of the customer.  
   - **Income:** Yearly household income (in dollars).  
   - **Kidhome:** Number of children in the household.  
   - **Teenhome:** Number of teenagers in the household.  
   - **Dt_Customer:** Date when the customer enrolled with the company.  
   - **Recency:** Number of days since the customer’s last purchase.  
   - **Complain:** Whether the customer complained in the last 2 years (1 for yes, 0 for no).  

2. **Spending Information (Last 2 Years)**  
   - **MntWines:** Amount spent on wine.  
   - **MntFruits:** Amount spent on fruits.  
   - **MntMeatProducts:** Amount spent on meat.  
   - **MntFishProducts:** Amount spent on fish.  
   - **MntSweetProducts:** Amount spent on sweets.  
   - **MntGoldProds:** Amount spent on gold products.  

3. **Purchase and Campaign Interaction**  
   - **NumDealsPurchases:** Number of purchases made using a discount.  
   - **AcceptedCmp1:** Response to the 1st campaign (1 for yes, 0 for no).  
   - **AcceptedCmp2:** Response to the 2nd campaign (1 for yes, 0 for no).  
   - **AcceptedCmp3:** Response to the 3rd campaign (1 for yes, 0 for no).  
   - **AcceptedCmp4:** Response to the 4th campaign (1 for yes, 0 for no).  
   - **AcceptedCmp5:** Response to the 5th campaign (1 for yes, 0 for no).  
   - **Response:** Response to the last campaign (1 for yes, 0 for no).  

4. **Shopping Behavior**  
   - **NumWebPurchases:** Number of purchases made through the company’s website.  
   - **NumCatalogPurchases:** Number of purchases made using catalogs.  
   - **NumStorePurchases:** Number of purchases made directly in stores.  
   - **NumWebVisitsMonth:** Number of visits to the company’s website in the last month.  

#### **Importing necessary libraries**

In [1]:
# --- Core Libraries for Data Handling and Numerical Operations ---
import pandas as pd
import numpy as np

# --- Libraries for Visualization ---
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True) # For offline Plotly use in Jupyter Notebook
from scipy.stats import gaussian_kde # For KDE calculation


# --- Libraries for Data Preprocessing ---
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder # OneHotEncoder might be needed later
from sklearn.impute import SimpleImputer

# --- Libraries for Clustering ---
from sklearn.cluster import KMeans

# --- Libraries for Dimensionality Reduction ---
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE # Uncomment if you plan to use t-SNE

# --- Libraries for Evaluating Cluster Performance ---
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# --- Potentially useful for advanced visualization/diagnostics ---
# import missingno as msno
# from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# --- Utility Libraries ---
import warnings
from datetime import datetime # For potential date calculations

# --- Setting up an aesthetic style for plots ---
# Seaborn style - using "white" for no grid, "ticks" could also be an option
sns.set_style("white")
plt.rcParams['axes.grid'] = False # Disable grid globally for matplotlib
# To remove top and right spines in seaborn plots for a cleaner look: sns.despine() # Call this after creating a plot

# Plotly template
import plotly.io as pio
pio.templates.default = "plotly_white" # Using a clean template for Plotly

# Ignore warnings for a cleaner notebook output
warnings.filterwarnings('ignore')

%config InlineBackend.figure_format = 'retina' # For high-resolution plots in Jupyter Notebook

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Seaborn version: {sns.__version__}")
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
import plotly
print(f"Plotly version: {plotly.__version__}")

Libraries imported successfully!
Pandas version: 2.2.2
NumPy version: 1.26.4
Seaborn version: 0.13.2
Scikit-learn version: 1.4.2
Plotly version: 5.22.0


#### **Loading the data**

In [None]:
# --- Define the file path ---
file_path = "03_marketing_campaign_data.csv"

# --- Load the dataset using pandas and create a working copy ---
try:
    # Read the tab-delimited CSV file into a pandas DataFrame
    df_raw = pd.read_csv(file_path, sep='\t')
    print(f"Successfully loaded data from: {file_path} into 'df_raw'.")

    # Create a working copy of the DataFrame to avoid modifying the original
    df = df_raw.copy()
    print("Created a working copy of the DataFrame named 'df'.")

except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
    print("Please ensure the file is in the correct directory or provide the full path.")
    df_raw = None # Assign None to prevent errors if loading fails
    df = None     # Assign None to prevent errors if loading fails

except Exception as e:
    print(f"An error occurred while loading the data: {e}")
    df_raw = None # Assign None in case of other errors
    df = None     # Assign None in case of other errors

Successfully loaded data from: marketing_campaign.csv into 'df_raw'.
Created a working copy of the DataFrame named 'df'.


## **Phase 1: Data Overview**

### 1.1. Initial Data Structure and Type Inspection
To understand the structure of our dataset, we'll use the `df.info()` method. This will provide a summary including the index dtype and columns, non-null values, and memory usage. It's particularly useful for a first check on data types and identifying columns with missing data.

In [3]:
import io

if 'df' in locals() and df is not None:
    print("--- DataFrame Information (df.info()) ---")
    # Using io.StringIO to capture the output of df.info() as a string for controlled printing.
    # This ensures the full output is displayed clearly, especially in environments
    # that might otherwise truncate long outputs or not show all columns for wide dataframes.
    buffer = io.StringIO()
    df.info(buf=buffer) # Pass the buffer to df.info()
    info_output = buffer.getvalue()
    print(info_output)

    # Additionally, it's often useful to explicitly see the number of missing values per column.
    print("\n--- Missing Values Per Column ---")
    missing_counts = df.isnull().sum()
    # Filter to show only columns that actually have missing values
    missing_counts = missing_counts[missing_counts > 0]

    if not missing_counts.empty:
        # Sorting helps to quickly identify columns with the most missing data
        print("Columns with missing values (sorted by count):")
        print(missing_counts.sort_values(ascending=False))
    else:
        print("No missing values found in any column.")

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully in the previous step.")
    print("Please ensure the data loading cell was executed correctly and the DataFrame 'df' is available.")

--- DataFrame Information (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 

### 1.1. Initial Data Structure and Type Inspection (`df.info()`)

The `df.info()` method provides a concise summary of the DataFrame:

* **Shape:** The dataset contains **2240 entries (rows)** and **29 columns (features)**.
* **Data Types and Non-Null Counts:**
    * **Numerical Features (`int64`, `float64`):**
        * `ID`, `Year_Birth`, `Kidhome`, `Teenhome`, `Recency`, `MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`, `NumDealsPurchases`, `NumWebPurchases`, `NumCatalogPurchases`, `NumStorePurchases`, `NumWebVisitsMonth` are `int64` and have no missing values (2240 non-null entries each). This also applies to the campaign acceptance flags (`AcceptedCmp1-5`, `Response`) and `Complain` column, which are expected to be binary `int64`.
        * `Income` is a `float64` column with **2216 non-null entries**, indicating **24 missing values**.
    * **Categorical Features (`object`):**
        * `Education` and `Marital_Status` are `object` type and have no missing values. These will likely require encoding for machine learning.
        * `Dt_Customer` is an `object` type (representing dates) with no missing values. This will need conversion to a `datetime` format for time-based analysis.
* **Memory Usage:** The output also includes an estimate of the memory usage by the DataFrame.

**Key Observations & Immediate Action Items:**

1.  **Missing Data:** The `Income` column has 24 missing values that need to be imputed or handled.
2.  **Data Type Conversion:** The `Dt_Customer` column should be converted from `object` to `datetime` to enable time-series feature engineering (e.g., customer tenure).
3.  **Categorical Encoding:** `Education` and `Marital_Status` will require transformation into a numerical format (e.g., one-hot encoding or label encoding) before being used in K-Means clustering.

### 1.2. Initial Data Preview & Descriptive Statistics

After understanding the data types and checking for missing values with `df.info()`, the next step is to get a feel for the actual data content. We'll use:
* `df.head()`: To view the first few rows of the DataFrame. This helps us see examples of data in each column, including the format of `Dt_Customer`.
* `df.tail()`: To view the last few rows, which can sometimes reveal issues with data collection or trailing entries.
* `df.describe(include='all')`: To generate descriptive statistics for all columns.
    * For **numerical columns**, it will show count, mean, standard deviation, min, max, and quartiles.
    * For **object/categorical columns** (like `Education`, `Marital_Status`, and `Dt_Customer` in its current state), it will show count, number of unique values, the most frequent value (top), and its frequency (freq).
    This comprehensive overview helps in understanding data distributions, identifying potential outliers or unusual values, and further planning data cleaning and feature engineering steps.

In [4]:
if 'df' in locals() and df is not None:
    print("--- First 5 Rows (df.head()) ---")
    display(df.head()) # Using display for better rendering of DataFrames in Jupyter

    print("\n--- Last 5 Rows (df.tail()) ---")
    display(df.tail())

    print("\n--- Descriptive Statistics for All Columns (df.describe(include='all')) ---")
    # For wide DataFrames, transposing can sometimes make it easier to read
    # display(df.describe(include='all').T)
    display(df.describe(include='all'))

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully in the previous step.")
    print("Please ensure the data loading cell was executed correctly and the DataFrame 'df' is available.")

--- First 5 Rows (df.head()) ---


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0



--- Last 5 Rows (df.tail()) ---


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
2235,10870,1967,Graduation,Married,61223.0,0,1,13-06-2013,46,709,...,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,10-06-2014,56,406,...,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,25-01-2014,91,908,...,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,24-01-2014,8,428,...,3,0,0,0,0,0,0,3,11,0
2239,9405,1954,PhD,Married,52869.0,1,1,15-10-2012,40,84,...,7,0,0,0,0,0,0,3,11,1



--- Descriptive Statistics for All Columns (df.describe(include='all')) ---


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
count,2240.0,2240.0,2240,2240,2216.0,2240.0,2240.0,2240,2240.0,2240.0,...,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0
unique,,,5,8,,,,663,,,...,,,,,,,,,,
top,,,Graduation,Married,,,,31-08-2012,,,...,,,,,,,,,,
freq,,,1127,864,,,,12,,,...,,,,,,,,,,
mean,5592.159821,1968.805804,,,52247.251354,0.444196,0.50625,,49.109375,303.935714,...,5.316518,0.072768,0.074554,0.072768,0.064286,0.013393,0.009375,3.0,11.0,0.149107
std,3246.662198,11.984069,,,25173.076661,0.538398,0.544538,,28.962453,336.597393,...,2.426645,0.259813,0.262728,0.259813,0.245316,0.114976,0.096391,0.0,0.0,0.356274
min,0.0,1893.0,,,1730.0,0.0,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
25%,2828.25,1959.0,,,35303.0,0.0,0.0,,24.0,23.75,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
50%,5458.5,1970.0,,,51381.5,0.0,0.0,,49.0,173.5,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
75%,8427.75,1977.0,,,68522.0,1.0,1.0,,74.0,504.25,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0


### 1.2. Initial Data Preview & Descriptive Statistics

A first look at the data rows and summary statistics reveals several key characteristics:

* **Data Glimpse (`df.head()`, `df.tail()`):**
    * The `ID` column appears to be a unique identifier for customers.
    * `Year_Birth` shows years like 1957, 1954, etc., suggesting customer ages.
    * `Education` includes values like "Graduation" and "PhD".
    * `Marital_Status` shows "Single," "Together," "Married," and "Divorced."
    * `Income` has floating-point values (e.g., 58138.0).
    * `Kidhome` and `Teenhome` contain small integers (0, 1, 2).
    * `Dt_Customer` (e.g., "04-09-2012") confirms its string format as **DD-MM-YYYY**, which is crucial for later conversion to datetime.
    * Spending (`Mnt...`), purchase (`Num...`), and campaign acceptance columns (`AcceptedCmp...`, `Response`) are numerical, mostly integer, with campaign flags appearing as 0s or 1s.
    * The `Complain` column also appears to be binary (0 or 1).
    * The columns `Z_CostContact` and `Z_Revenue` show constant values (3 and 11 respectively across all rows shown and in describe output). These might be fixed costs/values or placeholder data and warrant further investigation to determine their utility.

* **Descriptive Statistics (`df.describe(include='all')`):**
    * **General:** 2240 total entries.
    * **`Year_Birth`:** Ranges from 1893 to 1996. The minimum year (1893) seems unusually old and could be an outlier or data entry error. The mean birth year is around 1969.
    * **`Income`:** As noted from `df.info()`, there are 2216 non-null entries. The mean income is approximately $52,247, but there's a very wide range from $1,730 to $666,666. The high maximum value suggests potential outliers or a highly skewed distribution.
    * **`Education`:** There are 5 unique education levels, with "Graduation" being the most frequent (1127 occurrences).
    * **`Marital_Status`:** Shows 8 unique statuses. "Married" is the most frequent (864 occurrences). *(Note: The 8 unique values might indicate variations like "Alone", "Absurd", "YOLO" that are sometimes present in this dataset – worth checking unique values directly later).*
    * **`Dt_Customer`:** There are 663 unique enrollment dates, with "31-08-2012" being the most frequent (12 customers enrolled on this date).
    * **`Kidhome`/`Teenhome`:** Customers have between 0-2 kids and 0-2 teens at home. Most have 0 kids (mean 0.44) and 0 teens (mean 0.51), as indicated by medians (50th percentile) being 0 for both.
    * **Spending Columns (`Mnt...`):** Show wide variations, with some customers spending 0 and others spending up to around $1493 (e.g., on wines).
    * **Campaign Acceptance & `Response`:** These are binary (min 0, max 1). The mean values (e.g., `Response` mean is 0.149) indicate the proportion of customers who accepted each campaign/responded to the last one. For example, about 14.9% responded to the last campaign.
    * **`Complain`:** Very few customers have complained in the last 2 years (mean 0.009, so less than 1%).
    * **`Z_CostContact` & `Z_Revenue`:** These columns have a standard deviation of 0, a single unique value (3 and 11 respectively), confirming they are constant across the dataset. They are unlikely to be useful for segmentation as they don't vary.

**Key Immediate Observations & Next Steps (Reinforced):**

1.  **`Dt_Customer` Conversion:** Confirm format is DD-MM-YYYY and proceed with conversion to datetime.
2.  **`Income` Imputation:** Address the 24 missing values. The wide range and potential outliers should be considered during imputation.
3.  **`Year_Birth` Outlier:** Investigate the customer(s) with `Year_Birth` of 1893.
4.  **Categorical Variables:** Explore the unique values in `Education` and `Marital_Status` more closely to understand all categories before encoding.
5.  **Constant Columns:** `Z_CostContact` and `Z_Revenue` can likely be dropped as they provide no variance.

### 1.3. Feature Engineering & Data Type Optimization

To enhance our dataset for analysis and prepare it for modeling, we're undertaking several key feature engineering and data type optimization steps:

* **Convert `Dt_Customer` to Datetime:** The `Dt_Customer` column, representing the customer's enrollment date, was an `object` (string). We've converted it to a `datetime` object. This enables time-based calculations (like customer tenure, if needed) and ensures correct handling of date information. The observed format was "DD-MM-YYYY".
* **Create `Age` Column:** We've used the `Year_Birth` column to calculate the customer's `Age` as of the current year (2025). Age is often a more intuitive feature for segmentation than birth year. The outlier identified in `Year_Birth` (1893) has resulted in an unusually high age value for at least one record, which will require further review.
* **Optimize Categorical String Columns (Upcoming):** Columns like `Education` and `Marital_Status` are currently `object` types containing string values with a limited set of unique entries. In the next step, we plan to convert these to pandas' `category` data type.
    * **Why convert to `category`?**
        * **Memory Efficiency:** This can significantly reduce memory usage as pandas stores the unique values once and uses integer codes internally.
        * **Performance:** Certain operations on categorical data can be faster.
        * **Semantic Clarity:** It explicitly flags these columns as categorical, which can be beneficial for some libraries and for overall understanding of the data structure.

In [5]:
from datetime import datetime

if 'df' in locals() and df is not None:
    # --- Convert Dt_Customer to datetime ---
    try:
        print("--- Converting Dt_Customer to datetime ---")
        df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')
        print("Successfully converted 'Dt_Customer'. New Dtype:", df['Dt_Customer'].dtype)
    except Exception as e:
        print(f"Error converting Dt_Customer: {e}")

    # --- Create Age column ---
    try:
        print("\n--- Creating Age column ---")
        current_year = datetime.now().year # Using 2025 as the current year
        df['Age'] = current_year - df['Year_Birth']
        print(f"Successfully created 'Age' column calculated as of {current_year}.")

        # Display a sample of the new 'Age' column and its basic statistics
        print("\nSample of 'Age' column (first 5 values):")
        display(df[['Year_Birth', 'Age']].head())
        print("\nDescriptive statistics for 'Age':")
        display(df['Age'].describe())

        # Acknowledge potential age outlier due to Year_Birth outlier
        if 1893 in df['Year_Birth'].values:
            max_age_calculated = df['Age'].max()
            print(f"\nNote: The maximum age calculated is {max_age_calculated}, likely due to the Year_Birth outlier (1893). This will be reviewed.")

    except Exception as e:
        print(f"Error creating Age column: {e}")

    # --- Verify changes with df.info() for relevant columns ---
    print("\n--- Verifying DataFrame info for Dt_Customer and Age ---")
    # To make df.info() output more concise here, we can pass a list of columns
    # However, for a full picture, you might want to see the full info or just the relevant dtypes.
    # For now, let's check dtypes directly for these two.
    if 'Dt_Customer' in df.columns:
        print(f"Dt_Customer Dtype: {df['Dt_Customer'].dtype}")
    if 'Age' in df.columns:
        print(f"Age Dtype: {df['Age'].dtype}")
        print(f"Number of non-null values in Age: {df['Age'].count()}")

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure the data loading cell was executed correctly.")

--- Converting Dt_Customer to datetime ---
Successfully converted 'Dt_Customer'. New Dtype: datetime64[ns]

--- Creating Age column ---
Successfully created 'Age' column calculated as of 2025.

Sample of 'Age' column (first 5 values):


Unnamed: 0,Year_Birth,Age
0,1957,68
1,1954,71
2,1965,60
3,1984,41
4,1981,44



Descriptive statistics for 'Age':


count    2240.000000
mean       56.194196
std        11.984069
min        29.000000
25%        48.000000
50%        55.000000
75%        66.000000
max       132.000000
Name: Age, dtype: float64


Note: The maximum age calculated is 132, likely due to the Year_Birth outlier (1893). This will be reviewed.

--- Verifying DataFrame info for Dt_Customer and Age ---
Dt_Customer Dtype: datetime64[ns]
Age Dtype: int64
Number of non-null values in Age: 2240


In [6]:
print("--- Converting object columns to 'category' dtype ---")
categorical_cols_to_convert = ['Education', 'Marital_Status']
for col in categorical_cols_to_convert:
    if col in df.columns:
        print(f"Converting '{col}' from {df[col].dtype} to category...")
        df[col] = df[col].astype('category')
        print(f"'{col}' new dtype: {df[col].dtype}, Non-null: {df[col].count()}")
    else:
        print(f"Warning: Column '{col}' not found in DataFrame.")

# Verify by checking dtypes again, perhaps with df.info() or by printing them
print("\n--- Verifying Dtypes after conversion ---")
if all(col in df.columns for col in categorical_cols_to_convert):
    print(df[categorical_cols_to_convert].info())

--- Converting object columns to 'category' dtype ---
Converting 'Education' from object to category...
'Education' new dtype: category, Non-null: 2240
Converting 'Marital_Status' from object to category...
'Marital_Status' new dtype: category, Non-null: 2240

--- Verifying Dtypes after conversion ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Education       2240 non-null   category
 1   Marital_Status  2240 non-null   category
dtypes: category(2)
memory usage: 5.1 KB
None


### 1.4. Handling Missing Values in `Income`

During our initial data overview (`df.info()`), we identified 24 missing values in the `Income` column. For this iteration of the analysis, we will address these missing values by removing the rows where `Income` is not specified.

**Considerations for Dropping Missing Values:**

* **Amount of Data Loss:** We have 2240 initial entries. Dropping 24 rows represents a loss of approximately 1.07% of the total data. While not a very large percentage, it's always important to consider if the lost data might introduce bias or significantly reduce the dataset's representativeness.
* **Importance of the Feature:** `Income` is likely a crucial feature for customer segmentation. If the missingness were not random, dropping these rows could skew the analysis.
* **Alternative (Imputation):** In other scenarios, especially if the percentage of missing data were higher or if every data point was critical, imputation techniques (e.g., mean, median, regression imputation, or more advanced methods like K-Nearest Neighbors imputation) would be considered to fill in the missing `Income` values.

For now, we'll proceed with dropping these rows to ensure all entries used for clustering have a complete `Income` profile. We will then verify the shape of the DataFrame and check for any remaining missing values.

In [7]:
if 'df' in locals() and df is not None:
    print(f"Original DataFrame shape: {df.shape}")
    print(f"Original number of missing values in 'Income': {df['Income'].isnull().sum()}")

    # --- Drop rows where 'Income' is NaN ---
    # The .dropna() method is used here, specifically targeting the 'Income' column.
    # 'subset=['Income']' ensures that rows are dropped only if 'Income' is missing.
    # 'inplace=True' modifies the DataFrame directly.
    df.dropna(subset=['Income'], inplace=True)
    print("\n--- Dropped rows with missing 'Income' values ---")

    # --- Verify the changes ---
    print(f"New DataFrame shape: {df.shape}")
    print(f"Number of missing values in 'Income' after dropping: {df['Income'].isnull().sum()}")

    # Optional: Verify missing values across the entire DataFrame
    print("\nTotal missing values per column after operation:")
    missing_values_after = df.isnull().sum()
    # Filter to show only columns that still have missing values (if any)
    missing_values_after = missing_values_after[missing_values_after > 0]
    if not missing_values_after.empty:
        print(missing_values_after)
    else:
        print("No missing values found in any column after dropping NaNs from 'Income'.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure the data loading and previous processing cells were executed correctly.")

Original DataFrame shape: (2240, 30)
Original number of missing values in 'Income': 24

--- Dropped rows with missing 'Income' values ---
New DataFrame shape: (2216, 30)
Number of missing values in 'Income' after dropping: 0

Total missing values per column after operation:
No missing values found in any column after dropping NaNs from 'Income'.


### 1.5. Checking for and Handling Duplicate Rows

Before proceeding further, it's crucial to check for any duplicate entries in the dataset. Duplicate rows can skew analysis results, lead to incorrect model training, and generally indicate data quality issues.

* **Identification:** We will count the number of completely duplicate rows.
* **Action:** If duplicates are found, we will remove them, keeping the first occurrence by default. This ensures that each customer record (assuming `ID` is a unique customer identifier, or the entire row represents a unique observation) is represented only once.
* **Verification:** We will report the number of duplicates found and the shape of the DataFrame before and after removal.

In [8]:
if 'df' in locals() and df is not None:
    print(f"Original DataFrame shape: {df.shape}")

    # --- Check for duplicate rows ---
    # df.duplicated() returns a boolean Series indicating whether each row is a duplicate of a previous row.
    # .sum() then counts the number of True values (i.e., number of duplicate rows).
    num_duplicates = df.duplicated().sum()
    print(f"Number of duplicate rows found: {num_duplicates}")

    if num_duplicates > 0:
        print("Displaying duplicate rows (if any, shows all occurrences except the first):")
        # Show all occurrences of duplicates, then decide which to keep.
        # df[df.duplicated(keep=False)] would show all rows that are part of any duplication.
        # For now, just knowing the count is often enough before deciding to drop.
        # If you want to inspect them:
        # display(df[df.duplicated(keep=False)].sort_values(by=list(df.columns)))


        # --- Drop duplicate rows ---
        # 'keep='first'' is the default, meaning it keeps the first occurrence and removes subsequent duplicates.
        # 'inplace=True' modifies the DataFrame directly.
        df.drop_duplicates(inplace=True)
        print("\n--- Duplicate rows (if any) have been removed ---")
        print(f"New DataFrame shape after dropping duplicates: {df.shape}")
        print(f"Number of duplicate rows remaining: {df.duplicated().sum()}") # Should be 0
    else:
        # CORRECTED INDENTATION for the next line:
        print("No duplicate rows found in the DataFrame.")
        # This line was previously over-indented.
        # It should be at the same level as other print statements in this else block,
        # or if it was the only statement, aligned with the start of the block.
        # In this case, it seems you intended two print statements if no duplicates were found.
        # The line below was also part of the problem if it was intended to be in this else block.
        # print(f"Number of duplicate rows remaining: {df.duplicated().sum()}") # This was also in the wrong place in your screenshot's else

else: # This 'else' is for the initial 'if 'df' in locals()'
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure the data loading and previous processing cells were executed correctly.")

Original DataFrame shape: (2216, 30)
Number of duplicate rows found: 0
No duplicate rows found in the DataFrame.


### 1.6. Handling the `Age` Outlier

In the previous step where we created the `Age` column, we noted an individual with a calculated age of 132, stemming from a `Year_Birth` of 1893. This is highly improbable for a customer in this dataset and is considered a data entry error.

* **Identification:** We will specifically look for records where the `Age` is 132 (or greater than a reasonable upper threshold, e.g., 100 or 120, if we wanted to be broader, but for now, 132 is our target).
* **Quantification:** We'll count how many such records exist.
* **Action:** If the number of records with this extreme age is very small (e.g., 1-3 records), we will remove these rows. Removing a tiny fraction of data due to clear errors is generally acceptable and improves data quality for subsequent analysis and modeling. If there were many such records, a different strategy (like capping or more in-depth investigation) might be warranted, but that's not expected here.
* **Verification:** We will report the number of records removed and then check the new shape of the DataFrame and the descriptive statistics for the `Age` column to ensure the outlier has been handled.

In [9]:
if 'df' in locals() and df is not None:
    print(f"Original DataFrame shape: {df.shape}")
    print("Original descriptive statistics for 'Age':")
    display(df['Age'].describe())

    # --- Identify records with extreme age (e.g., Age == 132 or Age > 100) ---
    extreme_age_threshold = 100 # Setting a threshold slightly below 132 to catch it
    outlier_age_condition = df['Age'] >= extreme_age_threshold
    
    num_age_outliers = outlier_age_condition.sum()
    print(f"\nNumber of records with Age >= {extreme_age_threshold}: {num_age_outliers}")

    if num_age_outliers > 0:
        print("Displaying records with extreme age:")
        display(df[outlier_age_condition])

        # --- Decide whether to drop ---
        # For this specific case, if it's just 1 or a few, we'll drop.
        if 0 < num_age_outliers <= 3: # Arbitrary small number
            print(f"\nProceeding to drop {num_age_outliers} record(s) with extreme age.")
            # Get the indices of the rows to drop
            indices_to_drop = df[outlier_age_condition].index
            df.drop(indices_to_drop, inplace=True)
            
            print("\n--- Record(s) with extreme age have been removed ---")
            print(f"New DataFrame shape: {df.shape}")
            print("New descriptive statistics for 'Age':")
            display(df['Age'].describe())
        else:
            print(f"\nFound {num_age_outliers} records with extreme age. Reviewing before deciding to drop.")
            # If many outliers, you might reconsider the strategy.
    else:
        print(f"No records found with Age >= {extreme_age_threshold}.")

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure the data loading and previous processing cells were executed correctly.")



Original DataFrame shape: (2216, 30)
Original descriptive statistics for 'Age':


count    2216.000000
mean       56.179603
std        11.985554
min        29.000000
25%        48.000000
50%        55.000000
75%        66.000000
max       132.000000
Name: Age, dtype: float64


Number of records with Age >= 100: 3
Displaying records with extreme age:


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,Age
192,7829,1900,2n Cycle,Divorced,36640.0,1,0,2013-09-26,99,15,...,0,0,0,0,0,1,3,11,0,125
239,11004,1893,2n Cycle,Single,60182.0,0,1,2014-05-17,23,8,...,0,0,0,0,0,0,3,11,0,132
339,1150,1899,PhD,Together,83532.0,0,0,2013-09-26,36,755,...,0,0,1,0,0,0,3,11,0,126



Proceeding to drop 3 record(s) with extreme age.

--- Record(s) with extreme age have been removed ---
New DataFrame shape: (2213, 30)
New descriptive statistics for 'Age':


count    2213.000000
mean       56.082693
std        11.700216
min        29.000000
25%        48.000000
50%        55.000000
75%        66.000000
max        85.000000
Name: Age, dtype: float64

### 1.7. Dropping Irrelevant, Redundant, and Unsuitable Columns

As part of our data preparation for K-Means clustering, we will now remove columns that are either irrelevant, redundant, do not contribute meaningful variance, or are not in a suitable format for direct use in the clustering algorithm without further complex feature engineering that we are opting out of for this iteration. The goal is to retain only those features that can help differentiate customer groups based on their characteristics and behaviors.

The following columns will be dropped:

* **`ID`**: This is a unique identifier for each customer. While essential for record-keeping, it provides no information about customer characteristics and would act as noise in a clustering algorithm.
* **`Year_Birth`**: We have already engineered an `Age` column from `Year_Birth`. Keeping both would introduce redundancy. `Age` is generally more interpretable and directly usable for segmentation.
* **`Z_CostContact`**: This column was identified as having a constant value for all customers. Features with no variance do not help in distinguishing between groups and are thus uninformative for clustering.
* **`Z_Revenue`**: Similar to `Z_CostContact`, this column also has a constant value across all records and will be dropped for the same reason.
* **`Dt_Customer`**: This column represents the customer's enrollment date. While it was converted to a datetime object, using it directly in K-Means is not possible. Engineering a tenure feature was considered, but due to the age of the data relative to the present, we've opted to simplify and exclude time-based tenure for this analysis iteration. Thus, the original datetime column will be dropped.

After dropping these columns, we will verify the new shape of the DataFrame and list the remaining columns to ensure our dataset is becoming more focused for the subsequent clustering task.

In [10]:
if 'df' in locals() and df is not None:
    print(f"Original DataFrame shape (before this column dropping step): {df.shape}")
    print(f"Original columns (before this column dropping step): {df.columns.tolist()}")

    # Updated list of columns to drop
    columns_to_drop = ['ID', 'Year_Birth', 'Z_CostContact', 'Z_Revenue', 'Dt_Customer']
    
    # Check which of the columns to drop are actually present in the DataFrame.
    # This prevents errors if the cell is run multiple times or if some columns were already dropped
    # in an earlier (now removed/modified) version of the notebook.
    existing_columns_to_drop = [col for col in columns_to_drop if col in df.columns]

    if existing_columns_to_drop:
        print(f"\nColumns to be dropped in this step: {existing_columns_to_drop}")
        
        # --- Drop the specified columns ---
        # 'axis=1' indicates that we are dropping columns (not rows).
        # 'inplace=True' modifies the DataFrame directly.
        df.drop(columns=existing_columns_to_drop, axis=1, inplace=True)
        
        print("\n--- Specified columns have been dropped ---")
        print(f"New DataFrame shape: {df.shape}")
        print(f"Remaining columns: {df.columns.tolist()}")
    else:
        print("\nNo columns from the specified list were found to drop in the current DataFrame.")
        print(f"Current DataFrame shape: {df.shape}")
        print(f"Current columns: {df.columns.tolist()}")

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading cell (e.g., pd.read_csv()) was executed correctly before this step.")

Original DataFrame shape (before this column dropping step): (2213, 30)
Original columns (before this column dropping step): ['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response', 'Age']

Columns to be dropped in this step: ['ID', 'Year_Birth', 'Z_CostContact', 'Z_Revenue', 'Dt_Customer']

--- Specified columns have been dropped ---
New DataFrame shape: (2213, 25)
Remaining columns: ['Education', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumC

## **Phase 2: Exploratory Data Analysis**

Now that our data is cleaned and preprocessed to a good extent, we can dive into Exploratory Data Analysis (EDA). The goal of EDA is to understand the main characteristics of the data, uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. This understanding will be crucial for interpreting the results of our K-Means clustering later on.

We will structure our EDA as follows:
* **a. Univariate Analysis:** Examining individual features one by one to understand their distributions and characteristics.
* **b. Bivariate Analysis:** Exploring relationships between pairs of features.
* **c. Multivariate Analysis (Optional/As Needed):** Investigating interactions between three or more variables.

### **a. Univariate Analysis**

Univariate analysis focuses on describing and summarizing single variables in the dataset. For each variable, we want to understand its distribution, central tendency (mean, median), spread (variance, standard deviation, range), and identify any potential outliers or interesting patterns.

We'll look at:
* **Numerical Features:** Using histograms, box plots, and density plots.
* **Categorical Features:** Using bar charts and frequency tables.

### Defining a Reusable Function for Univariate Analysis

To efficiently analyze each variable in our dataset, we'll create a Python function. This function will take our DataFrame and a column name as input. Based on whether the column is numerical or categorical, it will:

* **For Numerical Columns:**
    * Print descriptive statistics (count, mean, std, min, quartiles, max).
    * Display an interactive histogram to visualize the distribution, potentially with a marginal box plot to also show spread and outliers.
* **For Categorical Columns:**
    * Print the frequency of each category (value counts).
    * Display an interactive bar chart to visualize these frequencies.

This approach promotes code reusability, consistency in our analysis, and makes the notebook easier to read and maintain. We'll use Plotly Express for creating the visualizations, ensuring they are interactive and adhere to our preference for a clean look (as our `plotly_white` template should be active).

In [11]:
def analyze_univariate(df, column_name, nbins_hist=30):
    """
    Performs customized univariate analysis for a specified column in a DataFrame.

    For numerical columns:
    - Prints descriptive statistics.
    - Displays a histogram (Navy Blue bars, y-axis as counts) with an overlaid 
      KDE line (Orange, scaled to counts).
    - Shows counts on top of histogram bars.
    - Includes a marginal box plot (Orange).

    For categorical/object columns:
    - Prints value counts.
    - Displays a bar chart (Navy Blue bars) with counts on top of bars.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column_name (str): The name of the column to analyze.
        nbins_hist (int): Number of bins for the histogram (if numerical).
    """
    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in DataFrame.")
        return

    print(f"--- Univariate Analysis for: {column_name} ---")
    # Use .copy() to avoid SettingWithCopyWarning if df is a slice
    column_data_full = df[[column_name]].copy() 
    column_data_clean = column_data_full[column_name].dropna() # For KDE and stats

    navy_blue = '#0B0056'
    orange_color = '#F86302'

    if pd.api.types.is_numeric_dtype(column_data_clean) and column_data_clean.nunique() > 1:
        print("\nDescriptive Statistics:")
        display(column_data_clean.describe().to_frame().T)

        print("\nDistribution Plot (Histogram with Scaled KDE & Marginal Box Plot):")
        
        # Create histogram with counts on y-axis
        fig = px.histogram(column_data_full, # Use full data for histogram to include NaNs if any were in original column
                           x=column_name, 
                           marginal="box", 
                           nbins=nbins_hist,
                           title=f"Distribution of {column_name} (Counts)",
                           color_discrete_sequence=[navy_blue],
                           text_auto=True # Shows actual counts on bars
                          )
        fig.update_layout(yaxis_title="Count") # Explicitly set y-axis title to Count

        # Style histogram bars
        fig.update_traces(marker_line_width=1.5, 
                          marker_line_color='black', # Add a border to bars for definition
                          selector=dict(type='histogram'))
        # Update marginal box plot color
        for trace in fig.data:
            if trace.type == 'box':
                trace.marker.color = orange_color
        
        # Add Scaled KDE Plot
        if len(column_data_clean) > 1:
            try:
                # Calculate KDE
                kde = gaussian_kde(column_data_clean)
                x_range = np.linspace(column_data_clean.min(), column_data_clean.max(), 500)
                kde_values_density = kde(x_range) # These are density values

                # To scale KDE to match histogram counts:
                # 1. Get bin width from the histogram (can be tricky if auto-binned)
                #    Alternatively, approximate or scale by N * (bin_size_approximation)
                #    A simpler scaling for visual overlay is to scale by total count and an estimate of bin width.
                #    Let's find the bin width from the generated histogram if possible.
                
                # Get histogram data to find bin width
                hist_data = go.Figure(fig).data[0] # Get the histogram trace
                if hasattr(hist_data, 'xbins') and hist_data.xbins and hist_data.xbins.size is not None:
                     bin_width = hist_data.xbins.size
                else: # Fallback if xbins.size is not available (e.g. for very few data points or specific Plotly versions)
                    # Approximate bin width (less accurate)
                    data_range = column_data_clean.max() - column_data_clean.min()
                    bin_width = data_range / nbins_hist if nbins_hist > 0 and data_range > 0 else 1


                # Scale KDE values: density * N * bin_width
                # N is the number of data points used for KDE
                kde_values_scaled = kde_values_density * len(column_data_clean) * bin_width
                
                fig.add_trace(go.Scatter(x=x_range, y=kde_values_scaled, 
                                         mode='lines', 
                                         name='KDE (Scaled)', 
                                         line=dict(color=orange_color, width=2), # KDE line color changed to orange
                                         yaxis='y1' # Ensure it plots on the primary y-axis
                                        ))
            except Exception as e:
                print(f"Could not generate KDE plot: {e}")
                print("Ensure 'scipy' is installed and 'from scipy.stats import gaussian_kde' is at the top of your script.")


        fig.update_layout(bargap=0.1)
        fig.show()

    elif pd.api.types.is_categorical_dtype(column_data_clean) or pd.api.types.is_object_dtype(column_data_clean):
        print("\nValue Counts (Frequency of Categories):")
        # Use original df[column_name] to include NaNs in value_counts if desired
        value_counts_df = df[column_name].value_counts(dropna=False).to_frame().reset_index()
        value_counts_df.columns = [column_name, 'Count']
        display(value_counts_df)

        print("\nFrequency Plot (Bar Chart):")
        fig_cat = px.bar(value_counts_df, 
                         x=column_name, 
                         y='Count',
                         title=f"Frequency of Categories in {column_name}",
                         color_discrete_sequence=[navy_blue],
                         text_auto=True
                        )
        
        fig_cat.update_traces(marker=dict(line=dict(color='black', width=0.5)), # Add a subtle border
                              # marker_cornerradius=10 # This is not a standard property for marker in px.bar traces
                              selector=dict(type='bar'))
        fig_cat.update_layout(xaxis_title=column_name, yaxis_title="Count")
        fig_cat.show()
    else:
        print(f"\nColumn '{column_name}' has an unsupported data type ({column_data_clean.dtype}), no data after dropping NaNs, or too few unique values for this analysis function.")


#### 1. Age:

In [12]:
if 'df' in locals() and df is not None:
    if 'Age' in df.columns:
        # Call the reusable function to analyze the 'Age' column
        # You can adjust nbins_hist if you want more or fewer bins in the histogram
        analyze_univariate(df, 'Age', nbins_hist=25) 
    else:
        print("Error: 'Age' column not found in the DataFrame.")
        print("Please ensure the 'Age' column was created correctly in previous steps.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: Age ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,2213.0,56.082693,11.700216,29.0,48.0,55.0,66.0,85.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 1. Age - Univariate Analysis Summary

The analysis of the `Age` column for the 2,213 customers (after cleaning) reveals the following:

* **Central Tendency & Spread:**
    * The average customer age is approximately **56 years**, with a median age of **55 years**. This indicates a fairly symmetrical distribution around the center.
    * The ages range from **29 to 85 years**.
    * The standard deviation is about **11.7 years**, showing a moderate spread in customer ages.
    * The middle 50% of customers (Interquartile Range - IQR) are aged between **48 and 66 years**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram shows the highest concentration of customers in the **50-55 age group** (388 customers).
    * The KDE line confirms this peak, suggesting the most common ages are in the early 50s.
    * The distribution appears roughly **unimodal** (one main peak) and relatively symmetrical, with fewer customers at the younger (< 40) and older (> 75) extremes.
    * The box plot visually confirms the median and IQR, with no apparent outliers beyond the 29-85 range.

* **Key Insight:** The customer base is predominantly middle-aged, with a significant concentration in their late 40s to mid-60s.

#### 2. Income

In [13]:
if 'df' in locals() and df is not None:
    if 'Income' in df.columns:
        # Call the reusable function to analyze the 'Income' column
        # Income can have a wide range, so more bins might be useful, or fewer if too sparse.
        # Let's start with a slightly higher number of bins.
        analyze_univariate(df, 'Income', nbins_hist=50) 
    else:
        print("Error: 'Income' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: Income ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Income,2213.0,52236.581563,25178.603047,1730.0,35246.0,51373.0,68487.0,666666.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 2. Income - Univariate Analysis Summary

The analysis of the `Income` column for 2,213 customers shows a wide variation and some distinct characteristics:

* **Central Tendency & Spread:**
    * The average yearly household income is approximately **$52,237**, while the median income is slightly lower at **$51,373**. This suggests a slight right skew in the distribution.
    * There's a very wide range of incomes, from a minimum of **$1,730** to an extreme maximum of **$666,666**.
    * The standard deviation is substantial at **$25,179**, indicating significant income variability.
    * The middle 50% of customers (IQR) have incomes between **$35,246 and $68,487**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line clearly show a **strong right-skewed distribution**. The majority of customers have incomes concentrated in the lower to middle ranges (primarily below $100,000).
    * The distribution peaks roughly around the $40,000 - $60,000 mark.
    * There's a long tail extending towards higher incomes, with very few customers in the higher brackets (e.g., >$150,000).

* **Outliers:**
    * The box plot prominently displays several **outliers on the higher end**, including the maximum value of $666,666, which is significantly detached from the bulk of the data.

* **Key Insights:**
    * The customer base has a wide range of income levels, but it's predominantly composed of individuals with incomes below $100,000.
    * The presence of high-income outliers suggests that a small segment of customers has exceptionally high earnings compared to the majority. This skewness and the outliers are important considerations for segmentation and any subsequent modeling (e.g., transformations like log transform might be considered for `Income` before clustering if the skew is problematic).

#### 3. Amount Spent on Wine (`MntWines`)

In [14]:
if 'df' in locals() and df is not None:
    if 'MntWines' in df.columns:
        # Call the reusable function to analyze the 'MntWines' column
        # Spending data can also be skewed, so a decent number of bins is good.
        analyze_univariate(df, 'MntWines', nbins_hist=40) 
    else:
        print("Error: 'MntWines' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntWines ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntWines,2213.0,305.153638,337.30549,0.0,24.0,175.0,505.0,1493.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 3. Amount Spent on Wine (`MntWines`) - Univariate Analysis Summary

The analysis of `MntWines`, representing the amount spent on wine in the last two years by 2,213 customers, indicates the following:

* **Central Tendency & Spread:**
    * The average amount spent on wine is approximately **$305.15**, while the median is significantly lower at **$175.00**. This large difference between mean and median strongly indicates a right-skewed distribution.
    * Spending ranges from **$0 to $1,493**.
    * The standard deviation is high at **$337.31**, reflecting a wide dispersion in wine spending.
    * The middle 50% of customers (IQR) spent between **$24 and $505** on wine.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line clearly depict a **highly right-skewed distribution**. A very large number of customers (the tallest bar, 756 customers) spend very little on wine (in the first bin, $0 to $49).
    * The frequency of customers drops off sharply as the amount spent on wine increases.
    * There's a long tail extending towards higher spending amounts, indicating a smaller group of customers who are high-volume wine purchasers.

* **Outliers:**
    * The box plot shows many data points beyond the upper whisker, indicating numerous **outliers on the higher end of wine spending**. These are customers who spend considerably more on wine than the typical customer.

* **Key Insights:**
    * A significant portion of the customer base consists of low or non-spenders on wine.
    * There's a smaller but distinct segment of customers who are substantial wine spenders.
    * This strong skewness and the presence of high-spending outliers are important characteristics. For clustering, this might mean that "high wine spenders" could form a distinct group. Similar to `Income`, a transformation (e.g., log transform) might be considered for `MntWines` before clustering if the algorithm is sensitive to this skew.

#### 4. Amount Spent on Meat Products (`MntMeatProducts`)

In [15]:
if 'df' in locals() and df is not None:
    if 'MntMeatProducts' in df.columns:
        # Call the reusable function to analyze the 'MntMeatProducts' column
        # Similar to other spending data, let's use a reasonable number of bins.
        analyze_univariate(df, 'MntMeatProducts', nbins_hist=40) 
    else:
        print("Error: 'MntMeatProducts' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntMeatProducts ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntMeatProducts,2213.0,166.962494,224.226178,0.0,16.0,68.0,232.0,1725.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 4. Amount Spent on Meat Products (`MntMeatProducts`) - Univariate Analysis Summary

The analysis of `MntMeatProducts`, representing customer spending on meat over the last two years for 2,213 individuals, reveals the following characteristics:

* **Central Tendency & Spread:**
    * The average amount spent on meat products is approximately **$166.96**.
    * The median spending is considerably lower at **$68.00**, indicating a strong right-skewed distribution.
    * Spending ranges from **$0 to $1,725**.
    * A high standard deviation of **$224.23** reflects significant variability in meat expenditure.
    * The middle 50% of customers (IQR) spent between **$16 and $232** on meat products.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line vividly illustrate a **highly right-skewed distribution**. The overwhelming majority of customers (985 in the first bin) spend very little on meat products ($0 to $49).
    * Customer frequency decreases sharply as spending on meat increases, forming a long tail towards higher expenditure values.

* **Outliers:**
    * The box plot indicates a large number of **outliers on the higher end of spending**. These are customers who spend substantially more on meat than the bulk of the customer base.

* **Key Insights:**
    * A large segment of the customer base either does not purchase meat or purchases it in very small quantities.
    * A smaller group of customers accounts for significantly higher spending on meat.
    * This pronounced skewness is a key characteristic. Similar to `MntWines` and `Income`, if this feature is used in distance-based clustering, a transformation (e.g., log transform) might be beneficial to reduce the influence of the extreme values and long tail.

#### 5. Amount Spent on Fish Products (`MntFishProducts`)

In [16]:
if 'df' in locals() and df is not None:
    if 'MntFishProducts' in df.columns:
        # Call the reusable function to analyze the 'MntFishProducts' column
        analyze_univariate(df, 'MntFishProducts', nbins_hist=30) 
    else:
        print("Error: 'MntFishProducts' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntFishProducts ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntFishProducts,2213.0,37.635337,54.763278,0.0,3.0,12.0,50.0,259.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 5. Amount Spent on Fish Products (`MntFishProducts`) - Univariate Analysis Summary

The analysis of `MntFishProducts`, reflecting customer spending on fish over the last two years for 2,213 individuals, shows the following:

* **Central Tendency & Spread:**
    * The average amount spent on fish products is approximately **$37.64**.
    * The median spending is much lower at **$12.00**, strongly indicating a right-skewed distribution.
    * Spending ranges from **$0 to $259.00**.
    * The standard deviation is relatively high at **$54.76** compared to the mean, reflecting considerable variability.
    * The middle 50% of customers (IQR) spent between **$3 and $50** on fish products.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line clearly demonstrate a **highly right-skewed distribution**. The vast majority of customers (981 in the first bin) spend very little ( $0 to $9) on fish products.
    * The number of customers drops off rapidly as spending increases, forming a long tail towards higher spending amounts.

* **Outliers:**
    * The box plot highlights numerous **outliers on the higher end of spending**, representing customers who spend significantly more on fish than the typical customer.

* **Key Insights:**
    * A large proportion of the customer base spends very little or nothing on fish products.
    * A smaller segment of customers accounts for higher expenditures on fish.
    * This pronounced right-skewness is consistent with other specialized food spending categories (like wine and meat) and is an important characteristic to note for segmentation. Again, a transformation (e.g., log transform) might be considered for this feature before clustering if the skew is deemed problematic for the algorithm.

#### 6. Amount Spent on Fruits (`MntFruits`)

In [17]:
if 'df' in locals() and df is not None:
    if 'MntFruits' in df.columns:
        # Call the reusable function to analyze the 'MntFruits' column
        analyze_univariate(df, 'MntFruits', nbins_hist=30) 
    else:
        print("Error: 'MntFruits' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntFruits ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntFruits,2213.0,26.323995,39.735932,0.0,2.0,8.0,33.0,199.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 6. Amount Spent on Fruits (`MntFruits`) - Univariate Analysis Summary

The analysis of `MntFruits`, which tracks customer spending on fruit products over the last two years for 2,213 individuals, reveals the following:

* **Central Tendency & Spread:**
    * The average amount spent on fruits is approximately **$26.32**.
    * The median spending is significantly lower at **$8.00**, highlighting a strong right-skewed distribution.
    * Spending on fruits ranges from **$0 to $199.00**.
    * The standard deviation is **$39.74**, which is larger than the mean, indicating substantial relative variability.
    * The middle 50% of customers (IQR) spent between **$2 and $33** on fruits.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line clearly show a **highly right-skewed distribution**. An overwhelming majority of customers (1161 in the first bin) spend very little ($0 to $9) on fruits.
    * The frequency of customers decreases very sharply as spending on fruits increases, resulting in a long tail towards higher spending values.

* **Outliers:**
    * The box plot indicates a large number of **outliers on the higher end of spending**. These are customers who purchase considerably more fruit than the typical customer.

* **Key Insights:**
    * A very large proportion of the customer base spends minimally on fruits.
    * A smaller group of customers accounts for higher expenditures in this category.
    * This pronounced right-skewness is a consistent theme across several spending variables and is an important characteristic. As with other skewed spending data, a transformation (e.g., log transform) might be considered for this feature before clustering if the algorithm is sensitive to such distributions.

#### 7. Amount Spent on Sweet Products (`MntSweetProducts`)

In [18]:
if 'df' in locals() and df is not None:
    if 'MntSweetProducts' in df.columns:
        # Call the reusable function to analyze the 'MntSweetProducts' column
        analyze_univariate(df, 'MntSweetProducts', nbins_hist=30) 
    else:
        print("Error: 'MntSweetProducts' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntSweetProducts ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntSweetProducts,2213.0,27.034794,41.085433,0.0,1.0,8.0,33.0,262.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 7. Amount Spent on Sweet Products (`MntSweetProducts`) - Univariate Analysis Summary

The analysis of `MntSweetProducts`, representing customer spending on sweet items over the last two years for 2,213 individuals, shows the following characteristics:

* **Central Tendency & Spread:**
    * The average amount spent on sweet products is approximately **$27.03**.
    * The median spending is substantially lower at **$8.00**, clearly indicating a strong right-skewed distribution.
    * Spending ranges from **$0 to $262.00**.
    * The standard deviation is **$41.09**, which is larger than the mean, reflecting considerable relative variability in spending.
    * The middle 50% of customers (IQR) spent between **$1 and $33** on sweet products.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line vividly illustrate a **highly right-skewed distribution**. The vast majority of customers (1159 in the first bin) spend very little ($0 to $9) on sweets.
    * The frequency of customers decreases very rapidly as spending on sweets increases, forming a long tail towards higher spending values.

* **Outliers:**
    * The box plot indicates a significant number of **outliers on the higher end of spending**. These are customers who purchase considerably more sweet products than the typical customer.

* **Key Insights:**
    * A very large proportion of the customer base spends minimally on sweet products.
    * A smaller group of customers accounts for higher expenditures in this category.
    * This pronounced right-skewness is consistent across most of the individual spending variables analyzed so far. As with the others, a transformation (e.g., log transform) might be considered for this feature before clustering to mitigate the influence of the long tail and outliers if the chosen clustering algorithm is sensitive to them.

#### 8. Amount Spent on Gold Products (`MntGoldProds`)

In [19]:
if 'df' in locals() and df is not None:
    if 'MntGoldProds' in df.columns:
        # Call the reusable function to analyze the 'MntGoldProds' column
        analyze_univariate(df, 'MntGoldProds', nbins_hist=30) 
    else:
        print("Error: 'MntGoldProds' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: MntGoldProds ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MntGoldProds,2213.0,43.911432,51.699746,0.0,9.0,24.0,56.0,321.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 8. Amount Spent on Gold Products (`MntGoldProds`) - Univariate Analysis Summary

The analysis of `MntGoldProds`, representing customer spending on gold items over the last two years for 2,213 individuals, reveals the following:

* **Central Tendency & Spread:**
    * The average amount spent on gold products is approximately **$43.91**.
    * The median spending is lower at **$24.00**, indicating a right-skewed distribution, though the difference between mean and median is less extreme than for some other product categories.
    * Spending ranges from **$0 to $321.00**.
    * The standard deviation is **$51.70**, which is larger than the mean, suggesting significant variability in spending on gold.
    * The middle 50% of customers (IQR) spent between **$9 and $56** on gold products.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line clearly show a **right-skewed distribution**. A large number of customers (973 in the first bin) spend very little (likely $0) on gold products.
    * While skewed, the decline in customer frequency is somewhat more gradual after the initial peak compared to categories like fruits or sweets, before tapering off into a long tail.

* **Outliers:**
    * The box plot indicates several **outliers on the higher end of spending**, representing customers who purchase considerably more gold products than the typical customer.

* **Key Insights:**
    * A significant portion of the customer base spends little to nothing on gold products.
    * There's a segment of customers who do engage in purchasing gold, with spending extending up to a few hundred dollars.
    * The right-skewness is still a dominant feature. As with other skewed spending variables, a transformation (e.g., log transform) might be considered for this feature before clustering if the algorithm is sensitive to such distributions. The nature of gold purchases might also tie into different customer motivations (e.g., gifts, occasional luxury) compared to regular consumables.

#### 9. Number of Web Purchases (`NumWebPurchases`)

We now shift our focus from spending amounts to purchasing behavior, starting with the channels customers use. The `NumWebPurchases` column records the number of purchases made through the company’s website. Understanding the prevalence of online shopping among the customer base is key for digital marketing strategies and resource allocation.

In [20]:
if 'df' in locals() and df is not None:
    if 'NumWebPurchases' in df.columns:
        # Call the reusable function to analyze the 'NumWebPurchases' column
        # Since this is a count of purchases, the number of bins might not need to be as high
        # as for continuous spending, but let's start with a reasonable number.
        analyze_univariate(df, 'NumWebPurchases', nbins_hist=20) 
    else:
        print("Error: 'NumWebPurchases' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: NumWebPurchases ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NumWebPurchases,2213.0,4.087664,2.741664,0.0,2.0,4.0,6.0,27.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 9. Number of Web Purchases (`NumWebPurchases`) - Univariate Analysis Summary

The analysis of `NumWebPurchases`, representing the number of purchases made via the company's website by 2,213 customers, shows the following:

* **Central Tendency & Spread:**
    * Customers make an average of approximately **4.09 web purchases**.
    * The median number of web purchases is **4.0**, aligning closely with the mean, suggesting less skew than the monetary spending variables.
    * The number of web purchases ranges from **0 to 27**.
    * The standard deviation is about **2.74 purchases**.
    * The middle 50% of customers (IQR) made between **2 and 6 web purchases**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram shows the highest concentration of customers making around **2-3 web purchases**, with the most frequent count being 2 purchases (701 customers).
    * The distribution is **right-skewed**, though not as extremely as the monetary spending columns. Many customers make a few web purchases, and the frequency decreases as the number of purchases increases.
    * The KDE line confirms a peak in the lower range of purchase counts and a tail extending towards higher counts.

* **Outliers:**
    * The box plot indicates several **outliers on the higher end**, representing customers who make a significantly larger number of web purchases (e.g., 20, 25, 27) compared to the majority.

* **Key Insights:**
    * A substantial portion of customers engages in online shopping, typically making a few purchases.
    * There's a segment of customers who are more frequent online shoppers.
    * While skewed, the distribution is more concentrated in the lower counts compared to the wide-ranging, highly skewed monetary spending variables.

#### 10. Number of Store Purchases (`NumStorePurchases`)

In [21]:
if 'df' in locals() and df is not None:
    if 'NumStorePurchases' in df.columns:
        # Call the reusable function to analyze the 'NumStorePurchases' column
        # Similar to web purchases, let's start with a moderate number of bins.
        analyze_univariate(df, 'NumStorePurchases', nbins_hist=20) 
    else:
        print("Error: 'NumStorePurchases' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: NumStorePurchases ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NumStorePurchases,2213.0,5.805242,3.250752,0.0,3.0,5.0,8.0,13.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 10. Number of Store Purchases (`NumStorePurchases`) - Univariate Analysis Summary

The analysis of `NumStorePurchases`, which reflects the number of purchases made in physical stores by 2,213 customers, reveals the following:

* **Central Tendency & Spread:**
    * Customers make an average of approximately **5.81 store purchases**.
    * The median number of store purchases is **5.0**, which is relatively close to the mean.
    * The number of store purchases ranges from **0 to 13**.
    * The standard deviation is about **3.25 purchases**, indicating a moderate spread.
    * The middle 50% of customers (IQR) made between **3 and 8 store purchases**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram shows the highest concentration of customers making around **3 store purchases** (484 customers).
    * The distribution appears somewhat **multimodal** (multiple peaks) or at least not a simple smooth curve, with notable concentrations around 3, 5-6, and 8-10 purchases. The KDE line smooths this but still shows a broad central tendency rather than a sharp single peak.
    * Overall, the distribution is less skewed than the monetary spending variables or even web purchases, though it does tail off after about 10 purchases.

* **Outliers:**
    * The box plot does not indicate any extreme outliers far removed from the main distribution, suggesting the range of 0-13 purchases is fairly typical for this customer base.

* **Key Insights:**
    * In-store purchasing is a common activity, with most customers making between 3 and 8 purchases.
    * Unlike the highly skewed spending data, the number of store purchases shows a more distributed pattern across a moderate range of counts.
    * The physical store channel appears to be consistently utilized by a broad segment of the customer base.

#### 11. Number of Catalog Purchases (`NumCatalogPurchases`)

In [22]:
if 'df' in locals() and df is not None:
    if 'NumCatalogPurchases' in df.columns:
        # Call the reusable function to analyze the 'NumCatalogPurchases' column
        # Let's use a moderate number of bins, similar to other purchase count variables.
        analyze_univariate(df, 'NumCatalogPurchases', nbins_hist=20) 
    else:
        print("Error: 'NumCatalogPurchases' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: NumCatalogPurchases ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NumCatalogPurchases,2213.0,2.671487,2.927096,0.0,0.0,2.0,4.0,28.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 11. Number of Catalog Purchases (`NumCatalogPurchases`) - Univariate Analysis Summary

The analysis of `NumCatalogPurchases`, which indicates the number of purchases made via catalogs by 2,213 customers, reveals the following:

* **Central Tendency & Spread:**
    * Customers make an average of approximately **2.67 catalog purchases**.
    * The median number of catalog purchases is **2.0**, but a very large number of customers (1066, almost half) made **0 catalog purchases** (as indicated by the 25th percentile being 0 and the tallest bar in the histogram).
    * The number of catalog purchases ranges from **0 to 28**.
    * The standard deviation is about **2.93 purchases**.
    * The middle 50% of customers (IQR) made between **0 and 4 catalog purchases**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line show a **highly right-skewed distribution**, with a very prominent peak at 0 purchases.
    * The frequency of customers drops off extremely sharply after 0 purchases, with a long tail representing a smaller number of customers who make more catalog purchases.

* **Outliers:**
    * The box plot indicates several **outliers on the higher end**, representing customers who make a significantly larger number of catalog purchases (e.g., above 10, and up to 28) compared to the vast majority.

* **Key Insights:**
    * A very large segment of the customer base does not use catalogs for purchases at all.
    * For those who do use catalogs, the number of purchases is generally low.
    * There's a small niche of customers who are more active catalog shoppers. This channel seems to cater to a specific, smaller subset of the customer base.

#### 12. Recency (Days Since Last Purchase)

We will now analyze the `Recency` column, which indicates the number of days since a customer's last purchase. Recency is a key metric in RFM (Recency, Frequency, Monetary Value) analysis and is often strongly correlated with customer engagement and likelihood to respond to new offers. Customers who have purchased more recently are generally considered more active.

In [23]:
if 'df' in locals() and df is not None:
    if 'Recency' in df.columns:
        # Call the reusable function to analyze the 'Recency' column
        # Recency values typically range from 0 to a certain number of days (e.g., 0-100 or more).
        analyze_univariate(df, 'Recency', nbins_hist=25) 
    else:
        print("Error: 'Recency' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: Recency ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,2213.0,49.007682,28.941864,0.0,24.0,49.0,74.0,99.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 12. Recency (Days Since Last Purchase) - Univariate Analysis Summary

The analysis of the `Recency` column, indicating the number of days since a customer's last purchase for 2,213 individuals, shows the following:

* **Central Tendency & Spread:**
    * The average recency is approximately **49.01 days**.
    * The median recency is **49.0 days**, which is almost identical to the mean. This suggests a very symmetrical distribution.
    * Recency values range from **0 days (most recent) to 99 days (least recent)**.
    * The standard deviation is about **28.94 days**.
    * The middle 50% of customers (IQR) have a recency between **24 and 74 days**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line depict a relatively **uniform or flat distribution** across the range of 0 to 99 days. There isn't a strong single peak; instead, customer counts are fairly evenly spread across different recency bins.
    * There are slight variations, with the highest counts around 0-5 days (135 customers) and other minor peaks, but no dominant mode.

* **Outliers:**
    * The box plot does not indicate any outliers, as the data is quite evenly distributed within its 0-99 day range.

* **Key Insights:**
    * The customer base has a wide and fairly uniform spread in terms of how recently they made a purchase. There isn't a strong concentration of very recent or very old purchasers.
    * This uniform distribution is quite different from the skewed distributions we saw for spending and purchase count variables. It suggests that customer activity, in terms of when they last bought something, is quite varied.
    * This could be very useful for segmentation, as it might differentiate customers who are consistently engaging versus those who purchase more sporadically.

#### 13. Number of Purchases Made with a Discount (`NumDealsPurchases`)

Next, we will analyze the `NumDealsPurchases` column. This feature counts the number of purchases made by customers when a discount was applied. Understanding how frequently customers utilize deals can help identify price-sensitive segments and inform promotional strategies.

In [24]:
if 'df' in locals() and df is not None:
    if 'NumDealsPurchases' in df.columns:
        # Call the reusable function to analyze the 'NumDealsPurchases' column
        # This is a count variable, so let's use a moderate number of bins.
        analyze_univariate(df, 'NumDealsPurchases', nbins_hist=15) 
    else:
        print("Error: 'NumDealsPurchases' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: NumDealsPurchases ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NumDealsPurchases,2213.0,2.32535,1.924402,0.0,1.0,2.0,3.0,15.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 13. Number of Purchases Made with a Discount (`NumDealsPurchases`) - Univariate Analysis Summary

The analysis of `NumDealsPurchases`, which counts the number of purchases made with a discount by 2,213 customers, reveals the following:

* **Central Tendency & Spread:**
    * Customers make an average of approximately **2.33 deal purchases**.
    * The median number of deal purchases is **2.0**.
    * The number of deal purchases ranges from **0 to 15**.
    * The standard deviation is about **1.92 purchases**.
    * The middle 50% of customers (IQR) made between **1 and 3 deal purchases**.

* **Distribution Shape (Histogram & KDE):**
    * The histogram and KDE line show a **right-skewed distribution**, with the highest concentrations at 0-1 (1001 customers) and 2-3 (786 customers) deal purchases.
    * A significant number of customers make 0 or 1 deal purchases.
    * The frequency drops off sharply after 2-3 deal purchases, with a long tail indicating fewer customers making many deal purchases.

* **Outliers:**
    * The box plot indicates several **outliers on the higher end**, representing customers who make a significantly larger number of purchases using discounts (e.g., > 6, up to 15).

* **Key Insights:**
    * Most customers make a small number of purchases with discounts (0-3).
    * There's a smaller segment of customers who are more frequent deal users.
    * This suggests varying levels of price sensitivity or engagement with promotional offers across the customer base.

#### 14. Number of Website Visits Last Month (`NumWebVisitsMonth`)

We will now examine the `NumWebVisitsMonth` column, which records the number of visits to the company's website in the last month. This metric can provide insights into customer engagement with the online platform, brand interest, and potential for future online purchases, even if not every visit results in a transaction.

In [25]:
if 'df' in locals() and df is not None:
    if 'NumWebVisitsMonth' in df.columns:
        # Call the reusable function to analyze the 'NumWebVisitsMonth' column
        # This is a count variable, likely with a relatively small range.
        analyze_univariate(df, 'NumWebVisitsMonth', nbins_hist=15) 
    else:
        print("Error: 'NumWebVisitsMonth' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Univariate Analysis for: NumWebVisitsMonth ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NumWebVisitsMonth,2213.0,5.321735,2.425092,0.0,3.0,6.0,7.0,20.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 14. Number of Website Visits Last Month (`NumWebVisitsMonth`) - Univariate Analysis Summary

The analysis of `NumWebVisitsMonth`, representing the number of website visits in the last month by 2,213 customers, indicates the following:

* **Central Tendency & Spread:**
    * Customers visit the website an average of approximately **5.32 times per month**.
    * The median number of visits is **6.0**, slightly higher than the mean, suggesting a slight left skew or a distribution concentrated in the higher single digits.
    * The number of website visits ranges from **0 to 20**.
    * The standard deviation is about **2.43 visits**.
    * The middle 50% of customers (IQR) visited the website between **3 and 7 times** in the last month.

* **Distribution Shape (Histogram & KDE):**
    * The histogram shows the highest concentration of customers visiting the website around **6-7 times a month** (the tallest bar is at 6-7 visits with 722 customers).
    * There are also significant numbers of customers visiting 4-5 times (494) and 8-9 times (422).
    * The distribution is somewhat concentrated in the 3-9 visits range, with fewer customers visiting very infrequently (0-2 times) or very frequently (more than 10 times). The KDE line confirms a peak in the 6-7 visit range.

* **Outliers:**
    * The box plot indicates a few **outliers on the higher end**, representing customers who visit the website much more frequently (e.g., 13, 15, 17, 20 times) than the majority.

* **Key Insights:**
    * Most customers actively visit the website, typically multiple times a month.
    * The engagement, in terms of visit frequency, is fairly concentrated, with a peak around 6-7 visits.
    * A smaller group of highly engaged users visits the website much more often.

#### 15. Family Structure: Number of Children and Teenagers

Understanding the family composition of customers is crucial, as households with children or teenagers often have distinct needs and purchasing behaviors. We will analyze two related columns:
* `Kidhome`: Number of children in the customer's household.
* `Teenhome`: Number of teenagers in the customer's household.

These are count variables, likely with a small range of discrete values (e.g., 0, 1, 2).


In [26]:
if 'df' in locals() and df is not None:
    if 'Kidhome' in df.columns:
        print("--- Analyzing Number of Children (Kidhome) ---")
        nbins = df['Kidhome'].max() + 1 if df['Kidhome'].max() >= 0 else 3
        analyze_univariate(df, 'Kidhome', nbins_hist=int(nbins)) 
    else:
        print("Error: 'Kidhome' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Analyzing Number of Children (Kidhome) ---
--- Univariate Analysis for: Kidhome ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Kidhome,2213.0,0.441934,0.536965,0.0,0.0,0.0,1.0,2.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


In [27]:
if 'df' in locals() and df is not None:
    if 'Teenhome' in df.columns:
        print("\n\n--- Analyzing Number of Teenagers (Teenhome) ---")
        # Similar logic for nbins for Teenhome
        nbins = df['Teenhome'].max() + 1 if df['Teenhome'].max() >= 0 else 3
        analyze_univariate(df, 'Teenhome', nbins_hist=int(nbins)) 
    else:
        print("Error: 'Teenhome' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")



--- Analyzing Number of Teenagers (Teenhome) ---
--- Univariate Analysis for: Teenhome ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Teenhome,2213.0,0.505648,0.544236,0.0,0.0,0.0,1.0,2.0



Distribution Plot (Histogram with Scaled KDE & Marginal Box Plot):


#### 15. Family Structure: Number of Children and Teenagers - Univariate Analysis Summary

The analysis of `Kidhome` and `Teenhome` provides insights into the family composition of the 2,213 customers.

**a. Number of Children (`Kidhome`)**

* **Central Tendency & Spread:**
    * On average, customers have approximately **0.44 children** in their household.
    * The median number of children is **0**, indicating that more than half of the customers have no young children at home.
    * The number of children ranges from **0 to 2**.
    * The middle 50% of customers (IQR) have **0 children**. The 75th percentile is 1 child.

* **Distribution Shape (Histogram & KDE):**
    * The distribution is highly **right-skewed**.
    * The vast majority of customers (**1281 customers**) have **0 children**.
    * A significant number (**886 customers**) have **1 child**.
    * Very few customers (**46 customers**) have **2 children**.

* **Key Insights for `Kidhome`:**
    * A large portion of the customer base does not have young children at home.
    * Households with one child are the next most common group. Households with two young children are rare in this dataset.

**b. Number of Teenagers (`Teenhome`)**

* **Central Tendency & Spread:**
    * On average, customers have approximately **0.51 teenagers** in their household.
    * The median number of teenagers is **0**, indicating that more than half of the customers have no teenagers at home.
    * The number of teenagers ranges from **0 to 2**.
    * The middle 50% of customers (IQR) have **0 teenagers**. The 75th percentile is 1 teenager.

* **Distribution Shape (Histogram & KDE):**
    * The distribution is also **right-skewed**, similar to `Kidhome`.
    * The majority of customers (**1145 customers**) have **0 teenagers**.
    * A large group (**1017 customers**) has **1 teenager**.
    * A small number of customers (**51 customers**) have **2 teenagers**.

* **Key Insights for `Teenhome`:**
    * Similar to `Kidhome`, a majority of customers do not have teenagers at home, or they have one.
    * Households with two teenagers are uncommon.

* **Overall Family Structure Insights:**
    * The customer base is largely composed of households with no young children or no teenagers, or households with one child or one teenager.
    * Households with multiple young children or multiple teenagers are relatively rare. This information will be very useful when considering product offerings and marketing messages.

#### 16. Education Level

We now turn to our categorical features, starting with `Education`. This column describes the highest level of education attained by each customer. Education level can be a strong indicator of socioeconomic status, lifestyle, and consumer preferences, making it a valuable feature for segmentation.

In [28]:
if 'df' in locals() and df is not None:
    if 'Education' in df.columns:
        print("--- Analyzing Education Level ---")
        # Our analyze_univariate function will detect this as categorical
        # and produce value counts and a bar chart.
        analyze_univariate(df, 'Education') 
    else:
        print("Error: 'Education' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Analyzing Education Level ---
--- Univariate Analysis for: Education ---

Value Counts (Frequency of Categories):


Unnamed: 0,Education,Count
0,Graduation,1116
1,PhD,480
2,Master,365
3,2n Cycle,198
4,Basic,54



Frequency Plot (Bar Chart):


#### 16. Education Level - Univariate Analysis Summary

The analysis of the `Education` column for 2,213 customers provides insights into their educational backgrounds:

* **Frequency Distribution:**
    * **Graduation:** This is the most common education level, with **1,116 customers** (approximately 50.4%).
    * **PhD:** The second largest group consists of customers with a PhD, numbering **480** (approximately 21.7%).
    * **Master:** Customers with a Master's degree make up the third group, with **365 individuals** (approximately 16.5%).
    * **2n Cycle:** This category (often equivalent to a Bachelor's degree or post-secondary vocational training) includes **198 customers** (approximately 9.0%).
    * **Basic:** A smaller group of **54 customers** (approximately 2.4%) has a Basic level of education.

* **Visualization (Bar Chart):**
    * The bar chart visually confirms that "Graduation" is the predominant educational level.
    * There are substantial numbers of customers with advanced degrees (PhD and Master).
    * "Basic" education is the least common.

* **Key Insights:**
    * The customer base is generally well-educated, with a majority having completed at least a graduation-level degree.
    * A significant portion (over 38%) holds postgraduate qualifications (PhD or Master).
    * This high level of education might correlate with income levels, lifestyle choices, and product preferences, making it an important feature for segmentation.

#### 17. Marital Status

Finally, for our categorical features, we'll examine `Marital_Status`. This column describes the customer's current marital status, which can often influence household size, spending priorities, and lifestyle, making it a relevant factor for segmentation.

In [29]:
if 'df' in locals() and df is not None:
    if 'Marital_Status' in df.columns:
        print("--- Analyzing Marital Status ---")
        # Our analyze_univariate function will detect this as categorical
        # and produce value counts and a bar chart.
        analyze_univariate(df, 'Marital_Status') 
    else:
        print("Error: 'Marital_Status' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Analyzing Marital Status ---
--- Univariate Analysis for: Marital_Status ---

Value Counts (Frequency of Categories):


Unnamed: 0,Marital_Status,Count
0,Married,857
1,Together,572
2,Single,470
3,Divorced,231
4,Widow,76
5,Alone,3
6,Absurd,2
7,YOLO,2



Frequency Plot (Bar Chart):


#### 17. Marital Status - Univariate Analysis Summary

The analysis of the `Marital_Status` column for 2,213 customers reveals the distribution of their relationship statuses:

* **Frequency Distribution:**
    * **Married:** This is the largest group, with **857 customers** (approximately 38.7%).
    * **Together:** The second most common status is "Together" (often indicating a committed relationship without formal marriage), with **572 customers** (approximately 25.8%).
    * **Single:** There are **470 customers** who identify as "Single" (approximately 21.2%).
    * **Divorced:** A smaller group of **231 customers** are "Divorced" (approximately 10.4%).
    * **Widow:** There are **76 customers** who are "Widow(er)s" (approximately 3.4%).
    * **Very Small Categories:**
        * **Alone:** 3 customers (approximately 0.1%)
        * **Absurd:** 2 customers (approximately 0.1%)
        * **YOLO:** 2 customers (approximately 0.1%)

* **Visualization (Bar Chart):**
    * The bar chart visually confirms that "Married" and "Together" are the most prevalent marital statuses.
    * "Single" and "Divorced" also represent significant portions of the customer base.
    * The categories "Widow," "Alone," "Absurd," and "YOLO" are very small in comparison.

* **Key Insights:**
    * A majority of the customer base is in a partnered status ("Married" or "Together"), accounting for nearly 65% of customers.
    * A significant portion are "Single" or "Divorced."
    * The categories "Alone," "Absurd," and "YOLO" are extremely rare and might represent data peculiarities or very niche self-identifications. For clustering purposes, these very small categories might need to be grouped into an "Other" category or handled carefully to avoid them forming tiny, uninterpretable clusters or being treated as noise, depending on the encoding strategy used later.

### **b. Bivariate Analysis**

After examining individual features, the next step in our EDA is Bivariate Analysis. Here, we investigate the relationships and interactions between pairs of variables. Understanding these relationships can help us identify potential correlations, dependencies, and patterns that might be useful for segmentation.

We will explore relationships such as:

* **Numerical vs. Numerical:**
    * How does `Income` relate to spending on various product categories (e.g., `MntWines`, `MntMeatProducts`)?
    * Is there a relationship between `Age` and `Recency` or spending habits?
    * We'll use scatter plots and correlation matrices for this.

#### 1. Correlation Heatmap of Numerical Features

To begin our bivariate analysis, we'll generate a correlation heatmap for all the numerical features in our dataset. This heatmap will visually represent the Pearson correlation coefficients between each pair of numerical variables.

* **Pearson Correlation Coefficient:** This measures the linear relationship between two variables. It ranges from -1 to +1:
    * **+1:** Perfect positive linear correlation (as one variable increases, the other increases proportionally).
    * **-1:** Perfect negative linear correlation (as one variable increases, the other decreases proportionally).
    * **0:** No linear correlation.
* **Heatmap Interpretation:**
    * Colors will indicate the strength and direction of the correlation (e.g., darker shades for stronger correlations, different colors for positive and negative).
    * We will also display the correlation values on the map.
    * We are particularly interested in identifying strong positive or negative correlations, as these suggest relationships worth exploring further. For instance, we might expect `Income` to correlate positively with spending on certain product categories.

This overview will guide our subsequent choices for more detailed bivariate explorations using scatter plots.

In [30]:
if 'df' in locals() and df is not None:
    print("--- Generating Correlation Heatmap (Lower Triangle) ---")

    numerical_df = df.select_dtypes(include=np.number)

    if numerical_df.empty:
        print("Error: No numerical columns found in the DataFrame to calculate correlations.")
    else:
        print(f"Numerical columns selected for correlation: {numerical_df.columns.tolist()}")

        corr_matrix = numerical_df.corr()

        mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1) 
        
        corr_matrix_masked = corr_matrix.mask(mask)

        fig = px.imshow(corr_matrix_masked, # Use the masked matrix
                        text_auto='.2f',  # Display values, formatted to 2 decimal places
                        aspect="auto",
                        color_continuous_scale='RdBu_r',
                        zmin=-1, zmax=1,
                        title="Correlation Heatmap of Numerical Features (Lower Triangle)"
                       )
        
        fig.update_traces(textfont_size=10, # Adjust font size as needed (e.g., 8, 10, 12)
                         ) 

        num_features = len(corr_matrix.columns)
        plot_height = max(600, num_features * 35) 
        plot_width = max(700, num_features * 40)  

        fig.update_layout(
            height=plot_height,
            width=plot_width,
            xaxis_tickangle=-60,
            yaxis_tickangle=0,
            title_x=0.5,
            font_family="Arial, sans-serif",
            plot_bgcolor='white',
            paper_bgcolor='white',
            margin=dict(l=120, r=50, t=100, b=120) # Increased margins for labels
        )
        
        fig.update_xaxes(side="bottom", 
                         tickmode='array', 
                         tickvals=list(range(len(corr_matrix.columns))), 
                         ticktext=corr_matrix.columns,
                         automargin=True)
        fig.update_yaxes(tickmode='array', 
                         tickvals=list(range(len(corr_matrix.index))), 
                         ticktext=corr_matrix.index,
                         autorange="reversed", # Important for imshow with manual ticks
                         automargin=True)

        fig.show()
        
        print("\nInterpretation Guide:")
        print("- Values close to +1 (e.g., dark red) indicate a strong positive linear correlation.")
        print("- Values close to -1 (e.g., dark blue) indicate a strong negative linear correlation.")
        print("- Values close to 0 (e.g., white/light colors) indicate a weak or no linear correlation.")

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Generating Correlation Heatmap (Lower Triangle) ---
Numerical columns selected for correlation: ['Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response', 'Age']



Interpretation Guide:
- Values close to +1 (e.g., dark red) indicate a strong positive linear correlation.
- Values close to -1 (e.g., dark blue) indicate a strong negative linear correlation.
- Values close to 0 (e.g., white/light colors) indicate a weak or no linear correlation.


#### 1. Correlation Heatmap - Key Observations and Interpretations

The correlation heatmap reveals several important linear relationships between the numerical features in our dataset:

* **Income and Spending:**
    * `Income` shows **strong positive correlations** with spending on `MntWines` (0.58), `MntMeatProducts` (0.58), `MntFruits` (0.43), `MntFishProducts` (0.44), `MntSweetProducts` (0.44), and `MntGoldProds` (0.33). This is expected: higher income generally leads to higher spending across most product categories, especially discretionary ones like wine and meat.
    * `Income` also has a **moderate positive correlation** with `NumCatalogPurchases` (0.50) and `NumStorePurchases` (0.53), and a slightly weaker one with `NumWebPurchases` (0.39). This suggests higher income customers tend to make more purchases through catalogs and in stores.

* **Inter-Spending Correlations (Product Categories):**
    * There are **strong positive correlations among most spending categories**, particularly:
        * `MntWines` and `MntMeatProducts` (0.67).
        * `MntFruits` with `MntMeatProducts` (0.65), `MntFishProducts` (0.59), and `MntSweetProducts` (0.57).
        * `MntMeatProducts` with `MntFishProducts` (0.57) and `MntSweetProducts` (0.53).
    * This indicates that customers who spend more on one type of product (especially food/drink) tend to spend more on others as well, suggesting an overall higher "basket value" or general propensity to spend on these items.

* **Purchase Channels and Spending:**
    * `NumCatalogPurchases` shows a **strong positive correlation** with `MntWines` (0.63) and `MntMeatProducts` (0.73), and moderate correlations with other spending categories. This is a very strong link, suggesting catalog shoppers are high spenders, especially on premium items.
    * `NumStorePurchases` also correlates positively with spending, notably with `MntWines` (0.64).
    * `NumWebPurchases` shows moderate positive correlations with spending categories like `MntWines` (0.55).

* **Family Structure (`Kidhome`, `Teenhome`):**
    * `Kidhome` (number of young children) has **moderate to strong negative correlations** with `Income` (-0.43) and spending on most product categories (e.g., `MntWines`: -0.50, `MntMeatProducts`: -0.44, `MntFruits`: -0.37, `MntFishProducts`: -0.39, `MntSweetProducts`: -0.38, `MntGoldProds`: -0.36). This suggests that households with more young children tend to have lower incomes (in this dataset) and spend less on these items, possibly due to budget constraints or different priorities.
    * `Kidhome` also negatively correlates with `NumCatalogPurchases` (-0.50) and `NumStorePurchases` (-0.50), and `NumWebPurchases` (-0.37).
    * `Teenhome` shows weaker, mostly negative correlations with spending (e.g., `MntMeatProducts`: -0.26) and a slight positive correlation with `NumWebPurchases` (0.16) and `NumWebVisitsMonth` (0.13). The negative correlation with `MntMeatProducts` might be interesting.

* **Website Visits and Spending/Purchases:**
    * `NumWebVisitsMonth` has a **strong negative correlation** with `Income` (-0.56). This is counterintuitive if one expects higher income individuals to be more digitally savvy. It might suggest that lower-income individuals browse more (perhaps looking for deals) or that higher-income individuals make more decisive purchases with fewer visits, or use other channels more.
    * It also negatively correlates with most spending categories (e.g., `MntWines`: -0.32, `MntMeatProducts`: -0.54) and `NumStorePurchases` (-0.43). This is a very interesting pattern to explore – more web visits associated with lower overall spending and fewer store purchases.

* **Campaign Acceptance:**
    * `Response` (acceptance of the last campaign) shows moderate positive correlations with `AcceptedCmp1` (0.30), `AcceptedCmp3` (0.25), `AcceptedCmp4` (0.18), and `AcceptedCmp5` (0.32), suggesting customers who accepted previous campaigns were more likely to accept the last one.
    * `MntWines` has moderate positive correlations with acceptances for campaigns 1, 4, 5, and the final `Response` (around 0.25-0.47). Wine buyers seem more responsive to campaigns.
    * `Income` also shows some positive correlation with accepting campaigns, particularly `AcceptedCmp5` (0.34) and `AcceptedCmp1` (0.28).

* **`Age`:**
    * `Age` has a moderate positive correlation with `Teenhome` (0.36), which is logical.
    * Most other correlations with `Age` are weak, suggesting age alone isn't a strong linear driver of most spending or purchase behaviors in this dataset, though it might have non-linear relationships or interact with other variables.

* **`Recency`:**
    * `Recency` shows very weak correlations with most other variables, except for a slight negative correlation with `Response` (-0.20), meaning customers who purchased more recently were slightly more likely to respond to the last campaign.

**Key Takeaways for Further Bivariate Exploration:**

* The relationship between `Income` and various spending/purchase habits.
* The strong link between `NumCatalogPurchases` and high spending.
* The impact of `Kidhome` on income and spending.
* The surprising negative correlation between `NumWebVisitsMonth` and `Income`/spending.
* The characteristics of customers who respond to campaigns (e.g., wine buyers, higher income).

#### 2. Reusable Function for Bivariate Scatter Plots (Numerical vs. Numerical)

To systematically explore the relationships between pairs of numerical variables identified from our correlation heatmap (or other hypotheses), we will define a reusable function. This function will generate an interactive scatter plot.

* **Purpose:** Scatter plots help us visualize the relationship between two numerical variables. Each point on the plot represents an observation (a customer, in our case), plotted according to its values for the two selected variables.
* **Functionality:**
    * Takes the DataFrame and the names of two numerical columns (one for the x-axis, one for the y-axis) as input.
    * Creates an interactive scatter plot using Plotly Express.
    * Includes an option to add an Ordinary Least Squares (OLS) regression trendline to help visualize the linear trend, if any.
    * Allows for customization of plot titles and labels.
* **Benefits:** This approach ensures consistency in our plots, reduces code duplication, and makes it easy to quickly generate visualizations for different pairs of variables.

In [31]:
def plot_numerical_bivariate(df, x_col, y_col, 
                             trendline_type=None, title=None, 
                             color_col=None, hover_name_col=None,
                             log_x=False, log_y=False,
                             opacity_val=0.7, marker_size_val=6):
    """
    Generates an interactive scatter plot for two numerical variables,
    with optional trendline, color encoding, and log scaling.
    Uses navy blue for default points and orange for marginals/trendline.

    Args:
        df (pd.DataFrame): The input DataFrame.
        x_col (str): The name of the column for the x-axis.
        y_col (str): The name of the column for the y-axis.
        trendline_type (str, optional): 'ols', 'lowess'. Defaults to None.
        title (str, optional): Custom title for the plot.
        color_col (str, optional): Column name for color-encoding points.
        hover_name_col (str, optional): Column for bold hover tooltip name.
        log_x (bool, optional): If True, applies a log scale to the x-axis. Defaults to False.
        log_y (bool, optional): If True, applies a log scale to the y-axis. Defaults to False.
        opacity_val (float, optional): Opacity of scatter markers. Defaults to 0.7.
        marker_size_val (int, optional): Size of scatter markers. Defaults to 6.
    """
    if x_col not in df.columns or y_col not in df.columns:
        print(f"Error: One or both columns ('{x_col}', '{y_col}') not found in DataFrame.")
        return

    if not pd.api.types.is_numeric_dtype(df[x_col]) or not pd.api.types.is_numeric_dtype(df[y_col]):
        print(f"Error: Both '{x_col}' and '{y_col}' must be numerical columns for a scatter plot.")
        return

    plot_df = df.copy()

    navy_blue = '#0B0056'
    orange_color = '#F86302'

    print(f"--- Bivariate Scatter Plot: {y_col} vs. {x_col} ---")
    if log_x: print(f"Applying log scale to x-axis ({x_col})")
    if log_y: print(f"Applying log scale to y-axis ({y_col})")

    if title is None:
        title = f"Relationship between {x_col} and {y_col}"
        if color_col:
            title += f" (Colored by {color_col})"
        if log_x or log_y:
            title += " (Log Scale"
            if log_x and log_y: title += " X & Y"
            elif log_x: title += " X-axis"
            elif log_y: title += " Y-axis"
            title += ")"
    
    marker_color_setting = None
    color_discrete_sequence_setting = None
    if color_col is None:
        color_discrete_sequence_setting = [navy_blue]


    fig = px.scatter(plot_df,
                     x=x_col,
                     y=y_col,
                     trendline=trendline_type,
                     trendline_scope="overall",
                     color=color_col, # If None, color_discrete_sequence will apply
                     color_discrete_sequence=color_discrete_sequence_setting, # Sets default point color
                     hover_name=hover_name_col,
                     title=title,
                     labels={x_col: x_col.replace('_', ' '), y_col: y_col.replace('_', ' ')},
                     opacity=opacity_val,
                     marginal_y="box",
                     marginal_x="box",
                     log_x=log_x,
                     log_y=log_y
                    )

    for i, trace in enumerate(fig.data):
        if trace.type == 'box':
            fig.data[i].marker.color = orange_color
        elif trace.type == 'scatter' and color_col is None: # Ensure this is the main scatter trace
            fig.data[i].marker.color = navy_blue


    if trendline_type:
        try:
            for i, trace in enumerate(fig.data):
                if trace.mode == 'lines' and 'trendline' in trace.name.lower() if trace.name else False : # Heuristic for trendline
                     fig.data[i].line.color = orange_color 
                     fig.data[i].line.width = 2
        except Exception as e:
            print(f"Note: Could not directly style trendline color to orange, using Plotly default. Error: {e}")


    fig.update_layout(
        height=650,
        title_x=0.5,
        font_family="Arial, sans-serif",
        plot_bgcolor='white',
        paper_bgcolor='white',
        xaxis=dict(showgrid=False),
        yaxis=dict(showgrid=False)
    )
    
    fig.update_traces(marker=dict(size=marker_size_val,
                                  line=dict(width=0.5,
                                            color='DarkSlateGrey')), # Marker border for all scatter points
                      selector=dict(mode='markers'))

    fig.show()

#### 1. Income vs. Amount Spent on Wine (`MntWines`)

The correlation heatmap indicated a strong positive linear relationship (correlation coefficient of approximately 0.58) between `Income` and `MntWines`. We will now visualize this relationship using a scatter plot to:

* Confirm the positive trend (i.e., as income increases, spending on wine tends to increase).
* Observe the spread of data points and identify any potential patterns or outliers.
* Assess if the relationship appears strictly linear or if there might be other nuances.

In [32]:
if 'df' in locals() and df is not None:
    if 'Income' in df.columns and 'MntWines' in df.columns:
        print("--- Visualizing Income vs. MntWines ---")
        plot_numerical_bivariate(df, 
                                 x_col='Income', 
                                 y_col='MntWines', 
                                 trendline_type=None, # No trendline
                                 log_x=False,         # Original x-axis scale
                                 log_y=False,         # Original y-axis scale
                                 title='Income vs. Wine Spending',
                                 opacity_val=0.6,     # Adjust opacity as desired
                                 marker_size_val=5)    # Adjust marker size as desired
    else:
        print("Error: 'Income' or 'MntWines' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Visualizing Income vs. MntWines ---
--- Bivariate Scatter Plot: MntWines vs. Income ---


#### 1. Income vs. Amount Spent on Wine (`MntWines`) - Scatter Plot Observations

The scatter plot visualizing the relationship between `Income` and `MntWines` on their original scales reveals:

* **Concentration at Lower Values:** A large majority of customers are clustered in the lower-left portion of the plot, indicating lower incomes (e.g., mostly below $100k-$150k) and lower spending on wine (e.g., mostly below $400-$600).
* **Positive General Trend (with Variance):** While not strictly linear, there's a visible tendency for wine spending (`MntWines`) to increase as `Income` increases. However, there's considerable variance:
    * At lower to moderate income levels, wine spending varies widely from very low to moderately high.
    * At higher income levels, there's a potential for very high wine spending, but also instances of relatively low wine spending.
* **Influence of Outliers:**
    * The few customers with extremely high incomes (especially the one near $666k) significantly stretch the x-axis, compressing the view for the bulk of the data points. This particular high-income outlier does not appear to be a proportionally high wine spender.
    * Similarly, a few customers exhibit very high wine spending relative to the majority, even at moderate income levels.
* **Shape of Relationship:** The relationship doesn't appear to be perfectly linear across the entire range, partly due to the skewness in both variables and the presence of outliers. The density of points thins out considerably at higher income and higher wine spending levels.
* **Marginal Distributions:** The marginal box plots confirm the right-skewness previously observed for both `Income` and `MntWines` during univariate analysis.

**Key Takeaway:** Higher income allows for higher wine spending, but the relationship isn't uniform. Many customers, regardless of income up to a certain point, spend modestly on wine, while a segment of higher-income individuals shows a greater propensity for higher wine expenditure. The plot also highlights how outliers can affect the visual interpretation on original scales.

#### 2. Income vs. Amount Spent on Meat Products (`MntMeatProducts`)

Following our analysis of wine spending, we'll now visualize the relationship between `Income` and `MntMeatProducts`. The correlation heatmap indicated a strong positive linear relationship (correlation coefficient of approximately 0.58). This scatter plot will help us:

* Observe the trend: Does meat spending increase with income?
* Identify the spread of data points and any distinct patterns or clusters.
* Assess the linearity of the relationship and the influence of any outliers.

We will view this on original scales without a trendline first, to see the raw data distribution.

In [33]:
if 'df' in locals() and df is not None:
    if 'Income' in df.columns and 'MntMeatProducts' in df.columns:
        print("--- Visualizing Income vs. MntMeatProducts (Original Scale, No Trendline) ---")
        
        # Call the reusable function.
        # The function 'plot_numerical_bivariate' (from Canvas 'bivariate_scatter_function_v3')
        # will use navy blue for points and orange for marginal box plots by default
        # when 'color_col' is not specified.
        plot_numerical_bivariate(df, 
                                 x_col='Income', 
                                 y_col='MntMeatProducts', 
                                 trendline_type=None, # No trendline
                                 log_x=False,         # Original x-axis scale
                                 log_y=False,         # Original y-axis scale
                                 title='Income vs. Meat Product Spending (Original Scale)',
                                 opacity_val=0.6,     # Adjust opacity as desired
                                 marker_size_val=5)    # Adjust marker size as desired
    else:
        print("Error: 'Income' or 'MntMeatProducts' column not found in the DataFrame.")
        print("Please ensure the DataFrame is loaded and preprocessed correctly.")
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Visualizing Income vs. MntMeatProducts (Original Scale, No Trendline) ---
--- Bivariate Scatter Plot: MntMeatProducts vs. Income ---


#### 2. Income vs. Amount Spent on Meat Products (`MntMeatProducts`) - Scatter Plot Observations (Original Scale)

The scatter plot visualizing the relationship between `Income` and `MntMeatProducts` on their original scales reveals:

* **Concentration at Lower Values:** The majority of customers are clustered in the lower-left area of the plot, indicating lower incomes (primarily below $100k-$150k) and correspondingly lower spending on meat products (mostly below $400-$600).
* **General Positive Trend:** There is a clear tendency for spending on meat products (`MntMeatProducts`) to increase as `Income` increases. This aligns with the positive correlation (0.58) observed in the heatmap.
* **Significant Variance:** While a positive trend exists, there's substantial variability in meat spending at any given income level, especially for moderate to higher incomes. Some individuals with higher incomes spend a lot on meat, while others spend relatively little.
* **Influence of Outliers:**
    * The extreme income outlier (around $666k) is present and, similar to wine spending, does not exhibit proportionally extreme spending on meat.
    * There are also customers who are outliers in terms of meat spending (high `MntMeatProducts`) even at more moderate income levels.
* **Shape of Relationship:** The relationship fans out as income increases, suggesting that while higher income enables higher meat spending, it doesn't guarantee it. The densest cluster of points is where both income and meat spending are relatively low.
* **Marginal Distributions:** The marginal box plots reiterate the right-skewness previously identified for both `Income` and `MntMeatProducts` during univariate analysis.

**Key Takeaway:** Higher income is associated with higher spending on meat products, but the relationship is not strictly linear and has considerable variance. Many customers, particularly at lower to mid-income levels, spend modestly on meat. The plot highlights that while income is a factor, other preferences or factors likely influence meat expenditure.

#### 3. Income vs. Various Spending Categories (Fruits, Fish, Sweets, Gold)

To efficiently compare how `Income` relates to spending on the remaining product categories (`MntFruits`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`), we will generate a series of scatter plots in a single figure. Each plot will show `Income` on the x-axis and the amount spent on one of these specific categories on the y-axis.

This approach allows for:
* **Direct Comparison:** Easily see if the nature of the relationship with `Income` (e.g., strength, spread, presence of outliers) differs across these less dominant spending categories.
* **Efficiency:** Consolidates multiple related visualizations.

We will view these on their original scales and without trendlines to observe the raw data patterns. Marginal box plots will be included for each axis.

In [34]:
if 'df' in locals() and df is not None:
    spending_categories_to_plot = ['MntFruits', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
    
    required_cols = ['Income'] + spending_categories_to_plot
    missing_cols = [col for col in required_cols if col not in df.columns]

    if missing_cols:
        print(f"Error: The following required columns are missing: {missing_cols}")
    else:
        print(f"--- Visualizing Income vs. Spending on: {', '.join(spending_categories_to_plot)} ---")

        if '__temp_id_for_melt__' not in df.columns:
            df_melt = df.reset_index().rename(columns={'index': '__temp_id_for_melt__'})
        else:
            df_melt = df.copy()

        df_long = pd.melt(df_melt, 
                          id_vars=['__temp_id_for_melt__', 'Income'], 
                          value_vars=spending_categories_to_plot,
                          var_name='SpendingCategory', 
                          value_name='AmountSpent')

        navy_blue = '#0B0056'
        orange_color = '#F86302'

        fig = px.scatter(df_long,
                         x='Income',
                         y='AmountSpent',
                         facet_row='SpendingCategory', # Creates a row of plots for each category
                         title='Income vs. Spending on Various Product Categories (Original Scale)',
                         labels={'AmountSpent': 'Amount Spent', 'Income': 'Income'},
                         opacity=0.6,
                         marginal_x="box", # Marginal box plot for Income (common for all facets)
                         marginal_y="box", # Marginal box plot for AmountSpent (specific to each facet)
                         height=250 * len(spending_categories_to_plot) # Adjust height based on number of facets
                        )

        fig.update_traces(marker=dict(color=navy_blue, size=5, line=dict(width=0.5, color='DarkSlateGrey')),
                          selector=dict(mode='markers'))
       
        for trace_data in fig.data:
            if trace_data.type == 'box':
                trace_data.marker.color = orange_color
        
        fig.update_layout(
            title_x=0.5,
            font_family="Arial, sans-serif",
            plot_bgcolor='white',
            paper_bgcolor='white'
        )
        
        for i, category_name in enumerate(spending_categories_to_plot):
            pass

        fig.update_xaxes(showgrid=False)
        fig.update_yaxes(showgrid=False)
        
        if '__temp_id_for_melt__' in df_melt.columns and '__temp_id_for_melt__' not in df.columns:
             pass


        fig.show()
        
        if '__temp_id_for_melt__' not in df.columns and '__temp_id_for_melt__' in df_melt.columns :
            pass


else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Visualizing Income vs. Spending on: MntFruits, MntFishProducts, MntSweetProducts, MntGoldProds ---


#### 3. Income vs. Various Spending Categories (Fruits, Fish, Sweets, Gold) - Scatter Plot Observations

The faceted scatter plots visualize the relationship between `Income` and spending on `MntFruits`, `MntFishProducts`, `MntSweetProducts`, and `MntGoldProds` on their original scales.

* **Common Pattern of Concentration:**
    * For all four product categories, the vast majority of customers are clustered in the lower-left portion of each plot. This indicates that most customers have lower to moderate incomes (primarily < $100k-$150k) and correspondingly low spending on these specific items (generally well below $50-$100 for most, with many near $0).

* **General Positive Trend (with High Variance):**
    * In each category, there's a visible positive trend: as `Income` increases, the potential for higher spending on these items also increases.
    * However, this relationship is characterized by high variance. Even at higher income levels, spending on these specific categories can range from very low to moderately high. Not all high-income individuals are high spenders in these particular categories.

* **Influence of Outliers:**
    * The extreme `Income` outlier (around $666k) is present in all plots and consistently shows low to moderate spending in these specific categories, not proportional to their income.
    * Each spending category (`MntFruits`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`) also has its own spending outliers (customers spending significantly more than average in that category), even at more modest income levels.

* **Relative Spending Levels:**
    * Visually, the maximum spending levels appear highest for `MntGoldProds` (up to ~$300 from the previous univariate analysis) compared to `MntFruits`, `MntFishProducts`, and `MntSweetProducts`, where maximums were generally lower (around ~$200-$260).
    * However, for the bulk of the customers, spending on all these four categories remains relatively low.

* **Marginal Distributions:**
    * The marginal box plot for `Income` (on the x-axis, common to all facets) reiterates its right-skewness.
    * The marginal box plots for each `AmountSpent` category (on the y-axes) confirm the strong right-skewness previously observed for each of these spending variables during univariate analysis.

**Key Takeaway:**
While higher income enables higher spending across these product categories (fruits, fish, sweets, gold), the actual expenditure is generally low for the majority of customers. Significant spending in these specific categories is less common than in broader categories like wine or meat, and when it occurs, it's often driven by a smaller subset of customers, not strictly dictated by income alone. These categories might represent more niche preferences or occasional purchases for most.

#### 4. Age vs. Various Spending Categories

To understand how spending patterns might differ across age groups, we will now generate a series of scatter plots. Each plot will show `Age` on the x-axis and the amount spent on one of the product categories (`MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`) on the y-axis.

This faceted approach will allow us to:
* **Compare Age-Related Trends:** Easily see if the relationship between age and spending varies significantly from one product category to another.
* **Identify Age-Specific Preferences:** Potentially spot if certain age groups tend to spend more on particular types of products.
* **Observe Overall Patterns:** See if there are general tendencies for spending to increase, decrease, or remain stable with age across categories.

We will view these on their original scales and without trendlines to observe the raw data patterns. Marginal box plots will be included for each axis to show the respective distributions.

In [35]:
if 'df' in locals() and df is not None:
    spending_categories_to_plot = [
        'MntWines', 'MntFruits', 'MntMeatProducts', 
        'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
    ]
    
    required_cols = ['Age'] + spending_categories_to_plot
    missing_cols = [col for col in required_cols if col not in df.columns]

    if missing_cols:
        print(f"Error: The following required columns are missing: {missing_cols}")
    else:
        print(f"--- Visualizing Age vs. Spending on: {', '.join(spending_categories_to_plot)} ---")

        if '__temp_id_for_melt__' not in df.columns:
            df_melt = df.reset_index().rename(columns={'index': '__temp_id_for_melt__'})
        else:
            df_melt = df.copy() # Use a copy if the temp ID column already exists

        df_long_age = pd.melt(df_melt, 
                              id_vars=['__temp_id_for_melt__', 'Age'], # 'Age' is our common x-axis
                              value_vars=spending_categories_to_plot,
                              var_name='SpendingCategory', 
                              value_name='AmountSpent')

        navy_blue = '#0B0056'
        orange_color = '#F86302'

        fig = px.scatter(df_long_age,
                         x='Age', # X-axis is now Age
                         y='AmountSpent',
                         facet_row='SpendingCategory', # Creates a row of plots for each category
                         title='Age vs. Spending on Various Product Categories (Original Scale)',
                         labels={'AmountSpent': 'Amount Spent', 'Age': 'Age'},
                         opacity=0.6,
                         marginal_x="box", # Marginal box plot for Age (common for all facets)
                         marginal_y="box", # Marginal box plot for AmountSpent (specific to each facet)
                         height=270 * len(spending_categories_to_plot) # Adjust height
                        )

        # Customize appearance
        fig.update_traces(marker=dict(color=navy_blue, size=5, line=dict(width=0.5, color='DarkSlateGrey')),
                          selector=dict(mode='markers'))
        
        # Update marginal box plot colors
        for trace_data in fig.data:
            if trace_data.type == 'box':
                trace_data.marker.color = orange_color
        
        fig.update_layout(
            title_x=0.5,
            font_family="Arial, sans-serif",
            plot_bgcolor='white',
            paper_bgcolor='white'
        )
        
        fig.update_xaxes(showgrid=False)
        fig.update_yaxes(showgrid=False)
       
        if '__temp_id_for_melt__' in df_melt.columns and '__temp_id_for_melt__' not in df.columns:
            pass


        fig.show()

else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Visualizing Age vs. Spending on: MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds ---


#### 4. Age vs. Various Spending Categories - Scatter Plot Observations

The faceted scatter plots visualize the relationship between customer `Age` and their spending on different product categories (`MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`).

* **General Observation Across Categories:**
    * For most product categories, there isn't a very strong, simple linear relationship (like a clear increase or decrease) between `Age` and the amount spent.
    * The majority of customers, regardless of age, tend to have lower to moderate spending in most individual categories, as evidenced by the concentration of points towards the bottom of the y-axis in each plot.

* **Specific Category Observations:**
    * **`MntWines` vs. `Age`:**
        * Spending on wine is observed across all age groups.
        * While there isn't a strict linear trend, the *highest* levels of wine expenditure (the upper band of the scatter) appear more frequently among middle-aged customers (e.g., roughly 40s to late 60s). Very young (e.g., <35) and very old (e.g., >75-80) customers show fewer instances of extremely high wine spending.
    * **`MntMeatProducts` vs. `Age`:**
        * Similar to wine, spending on meat products occurs across various ages.
        * The capacity for higher spending on meat also seems more prevalent in the middle-age groups. The scatter of points for higher spending is denser in the 40-70 age range.
    * **`MntFruits`, `MntFishProducts`, `MntSweetProducts` vs. `Age`:**
        * For these categories, spending is generally low across all age groups.
        * There are no prominent age-related trends; the scatter is mostly concentrated near the bottom of the y-axis for all ages, with some outliers.
    * **`MntGoldProds` vs. `Age`:**
        * Spending on gold products is also dispersed across different age groups without a very clear linear increase or decrease. There are instances of higher gold spending across various ages, not strongly concentrated in one particular age segment.

* **Marginal Distributions:**
    * The marginal box plot for `Age` (common x-axis) shows the age distribution we analyzed earlier (concentrated in middle age).
    * The marginal box plots for each `AmountSpent` category (y-axes) reiterate their strong right-skewness, with many customers spending little and a few spending significantly more.

**Key Takeaway:**
Age, by itself, does not appear to be a dominant linear driver for the *amount* spent in most individual product categories. While middle-aged customers show a greater propensity for higher spending in categories like wine and meat, the overall spending in more niche categories (fruits, fish, sweets) is low across all age groups. This suggests that other factors (like income, lifestyle, preferences, family structure) likely play a more significant role in determining spending levels than age alone, or that age might have more complex, non-linear interactions with spending.

#### 5. Inter-Relationships Between Spending Categories

Having examined how `Income` and `Age` relate to individual spending categories, we now explore how these spending categories relate to *each other*. Understanding these inter-relationships can reveal if customers who spend highly in one category also tend to spend highly in others (complementary purchasing) or if high spending in one category correlates with low spending in another (substitute purchasing or distinct preferences).

We will use a **scatter plot matrix (pair plot)** to visualize the pairwise relationships among all six `Mnt...` (amount spent) columns:
* `MntWines`
* `MntFruits`
* `MntMeatProducts`
* `MntFishProducts`
* `MntSweetProducts`
* `MntGoldProds`

Each individual scatter plot in the matrix will show one spending category against another. The diagonal of the matrix will show the distribution (e.g., histogram or KDE) of each individual spending category.

In [36]:
if 'df' in locals() and df is not None:
    spending_cols = [
        'MntWines', 'MntFruits', 'MntMeatProducts', 
        'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
    ]
    
    missing_spending_cols = [col for col in spending_cols if col not in df.columns]

    if missing_spending_cols:
        print(f"Error: The following spending columns are missing: {missing_spending_cols}")
    else:
        print(f"--- Generating Scatter Plot Matrix for Spending Categories (Lower Triangle with go.Splom) ---")
        
        navy_blue = '#0B0056'
        splom_dimensions = []
        for col in spending_cols:
            splom_dimensions.append(dict(label=col, values=df[col]))

        fig = go.Figure(data=go.Splom(
            dimensions=splom_dimensions,
            text=df.index, # Optional: text to show on hover (e.g., customer index)
            marker=dict(
                color=navy_blue, # Set marker color for scatter points
                size=5,
                line=dict(width=0.5, color='DarkSlateGrey')
            ),
            diagonal=dict(
                visible=True, # Show diagonal plots
            ),
            showupperhalf=False, # THIS IS THE KEY TO HIDE THE UPPER TRIANGLE!
            showlowerhalf=True   # Ensure lower half is shown
        ))
        
        fig.update_layout(
            title="Pairwise Relationships Between Spending Categories (Lower Triangle)",
            title_x=0.5,
            height=1000,
            width=1000,
            font_family="Arial, sans-serif",
            plot_bgcolor='white',
            paper_bgcolor='white',
            showlegend=False,
            dragmode='select', # Allows selecting points
            hovermode='closest'
        )
        fig.show()
        
else:
    print("Error: DataFrame 'df' not found or was not loaded successfully.")
    print("Please ensure your data loading and preprocessing cells were executed correctly.")

--- Generating Scatter Plot Matrix for Spending Categories (Lower Triangle with go.Splom) ---


#### 5. Inter-Relationships Between Spending Categories - Scatter Plot Matrix Summary

The scatter plot matrix visualizes the pairwise relationships between the six main spending categories (`MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`). The diagonal shows the univariate distribution (histogram) for each category, confirming their right-skewed nature. The off-diagonal plots are scatter plots.

**Key Observations from Pairwise Scatter Plots (Focusing on Lower Triangle):**

* **Strongest Positive Relationships (Complementary Spending):**
    * **`MntWines` vs. `MntMeatProducts`:** There's a clear positive trend. Customers who spend more on wine also tend to spend more on meat products. The points fan out, indicating more variability in meat spending at higher wine spending levels, but the general upward trend is noticeable. This was one of the strongest correlations we saw in the heatmap.
    * **`MntMeatProducts` vs. `MntFruits` & `MntFishProducts`:** Similar positive trends are visible, suggesting customers buying more meat also tend to buy more fruits and fish, though perhaps with a bit more spread than the wine-meat relationship.

* **Moderate Positive Relationships:**
    * **`MntWines` vs. `MntFruits`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`:** Positive trends exist, but they appear somewhat weaker or more dispersed than the wine-meat relationship. Higher wine spending generally aligns with higher spending in these other categories, but with more variance.
    * **`MntMeatProducts` vs. `MntSweetProducts`, `MntGoldProds`:** Positive, but potentially more scattered relationships.
    * **Relationships among `MntFruits`, `MntFishProducts`, `MntSweetProducts`:** These generally show positive but somewhat diffuse relationships. Customers spending more on one of these tend to spend a bit more on the others, but it's not a very tight correlation.

* **General Pattern of Skewness:**
    * In almost all pairwise plots, the majority of data points are clustered in the lower-left corner, representing customers who spend little on both categories being compared.
    * The "fanning out" pattern is common: as spending increases in one category, the spending in the other category also tends to increase, but with a wider range of possibilities (more variance).

* **`MntGoldProds` Relationships:**
    * Spending on gold products shows positive, but generally weaker and more scattered relationships with the food and drink categories compared to how those food/drink categories relate to each other. This suggests gold purchasing might be driven by slightly different factors or occurs less consistently in conjunction with grocery-type purchases.

**Overall Key Takeaway:**
There's a general tendency for customers who are high spenders in one product category to also be higher spenders in others, particularly among the food and wine categories. This suggests some customers have an overall higher "propensity to spend." The relationships are not perfectly linear, and there's considerable variation, but the positive associations are evident. `MntWines` and `MntMeatProducts` appear to be key categories where higher spending often coincides.

## **Phase 3: Customer Segmentation using K-Means Clustering**

---

Having completed a thorough Exploratory Data Analysis (EDA), the next phase of this project is to perform customer segmentation using the K-Means clustering algorithm. The primary goal is to identify distinct groups of customers based on their demographic, behavioral, and purchasing characteristics. These segments can then be used to tailor marketing strategies, improve product offerings, and enhance customer engagement.

Our approach will involve the following key steps:

1.  **Feature Selection for Clustering**:
    * Based on the insights from EDA and the objectives of segmentation, we will select the most relevant features from the cleaned dataset. This will likely include a mix of demographic data (`Age`, `Income`, `Education`, `Marital_Status`, `Kidhome`, `Teenhome`), spending patterns (`MntWines`, `MntFruits`, etc.), and purchasing behaviors (`NumDealsPurchases`, `NumWebPurchases`, etc., `Recency`).

2.  **Data Preprocessing for K-Means**:
    * **Encoding Categorical Features**: Convert categorical features like `Education` and `Marital_Status` into a numerical format suitable for K-Means. One-hot encoding will be used to avoid imposing any ordinal relationships. We will also consider grouping very rare categories in `Marital_Status` (e.g., "YOLO", "Absurd", "Alone") into an "Other" category to prevent feature sparsity.
    * **Handling Skewness**: Address the right skewness observed in several numerical features (e.g., `Income`, spending amounts) by applying a log transformation (e.g., `np.log1p`). This can help in making the distributions more symmetrical and improve clustering performance.
    * **Feature Scaling**: Standardize all numerical features using `StandardScaler` to ensure that all features contribute equally to the distance calculations in K-Means, regardless of their original scale.

3.  **Determining the Optimal Number of Clusters (k)**:
    * Employ various techniques to identify the most appropriate number of clusters (`k`) for our dataset. This will include:
        * The **Elbow Method** (observing the Within-Cluster Sum of Squares - WCSS).
        * **Silhouette Analysis** (evaluating the Silhouette Score).
        * **Davies-Bouldin Index**.
        * **Calinski-Harabasz Index**.
    * The choice of `k` will be guided by these metrics, aiming for a balance between cluster distinctiveness and practical interpretability.

4.  **Applying K-Means Algorithm**:
    * Train the K-Means clustering model using the prepared dataset and the determined optimal `k`. We will use `init='k-means++'` for smarter initialization and set `n_init` to run the algorithm multiple times with different centroid seeds to ensure a robust solution. A `random_state` will be set for reproducibility.

5.  **Cluster Analysis and Interpretation (Profiling)**:
    * Once the clusters are formed, we will analyze their characteristics by examining the mean/median values of the input features for each cluster.
    * Visualize the segments to understand their defining attributes, for instance, using bar charts for categorical variables and box plots for numerical variables across clusters.
    * Develop personas or descriptive labels for each customer segment based on these insights, which can then inform targeted business actions.

### Data Preparation for Clustering

Before we can determine the optimal number of clusters or apply the K-Means algorithm, we need to prepare the data. This involves several key transformations:

1.  **Create a Working DataFrame**: We'll make a copy of our cleaned dataset to ensure the original `df` from the EDA phase remains untouched.
2.  **Feature Selection and Categorization**: We will explicitly define lists of categorical, numerical, and skewed numerical features based on our EDA findings. This helps in applying transformations systematically.
3.  **Consolidate Rare Categories in `Marital_Status`**: The `Marital_Status` column contains a few categories with very few observations (e.g., "YOLO", "Absurd", "Alone"). To prevent creating very sparse features after one-hot encoding and to potentially improve cluster stability, we will group these into an "Other" category.
4.  **One-Hot Encode Categorical Features**: Convert categorical string features (`Education`, `Marital_Status`) into a numerical format using one-hot encoding. This creates new binary columns for each category, which is suitable for K-Means as it doesn't imply an ordinal relationship.
5.  **Address Skewness in Numerical Features**: Apply a log transformation (`np.log1p`) to the numerical features identified as highly skewed during EDA (e.g., `Income`, various spending amounts). This helps to make their distributions more symmetrical, which benefits distance-based algorithms like K-Means by reducing the influence of extreme values.

Let's implement these steps.

In [37]:
# 1. Create a Working DataFrame
df_cluster = df.copy()
print("Shape of the original DataFrame:", df.shape)
print("Shape of the DataFrame for clustering (df_cluster):", df_cluster.shape)
print("First 5 rows of df_cluster before preprocessing:")
print(df_cluster.head())
print("\n" + "="*50 + "\n")

# 2. Feature Selection and Categorization
categorical_features = ['Education', 'Marital_Status']
# All potential numerical features (excluding those that might be dropped or are target-like if not used for clustering)
# For K-Means, binary features (like campaign acceptance) can also be treated as numerical.
numerical_features = ['Income', 'Kidhome', 'Teenhome', 'Recency', 
                      'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 
                      'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 
                      'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 
                      'NumWebVisitsMonth', 'Age', 'AcceptedCmp1', 'AcceptedCmp2', 
                      'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Complain', 'Response']

# Identify skewed numerical features based on EDA
# (Ensure these are present in the numerical_features list)
skewed_features = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 
                   'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                   'NumCatalogPurchases', 'NumDealsPurchases', 'NumWebPurchases'] # Add/remove based on your EDA findings for skewness

print("Categorical features identified:", categorical_features)
print("Numerical features identified (subset for illustration):", numerical_features[:5]) # Print a few
print("Skewed numerical features identified (subset for illustration):", skewed_features[:5]) # Print a few
print("\n" + "="*50 + "\n")

# 3. Consolidate Rare Categories in `Marital_Status`
print("Original unique values in Marital_Status:", df_cluster['Marital_Status'].unique())
print("Original value counts for Marital_Status:")
print(df_cluster['Marital_Status'].value_counts())

# Consolidate rare categories like "Alone", "Absurd", "YOLO" into "Other"
# Adjust these based on your DataFrame's actual rare categories if different
rare_marital_categories = ['Alone', 'Absurd', 'YOLO']
df_cluster['Marital_Status'] = df_cluster['Marital_Status'].replace(rare_marital_categories, 'Other_Marital')

print("\nUpdated unique values in Marital_Status:", df_cluster['Marital_Status'].unique())
print("Updated value counts for Marital_Status:")
print(df_cluster['Marital_Status'].value_counts())
print("\n" + "="*50 + "\n")


# 4. One-Hot Encode Categorical Features
print(f"Shape of df_cluster before one-hot encoding: {df_cluster.shape}")
df_cluster = pd.get_dummies(df_cluster, columns=categorical_features, prefix=categorical_features, drop_first=False) # drop_first=False is safer for interpretation initially
print(f"Shape of df_cluster after one-hot encoding: {df_cluster.shape}")
print("First 5 rows of df_cluster after one-hot encoding (showing new columns):")
print(df_cluster.head()) # Display to verify new columns
# Show some of the new column names
new_ohe_columns = [col for col in df_cluster.columns if any(cat_feat in col for cat_feat in categorical_features)]
print("\nSample of new one-hot encoded columns:", new_ohe_columns[:7]) # Print a few
print("\n" + "="*50 + "\n")


# 5. Address Skewness in Numerical Features (Log Transformation)
print("Applying log transformation to skewed features...")
for feature in skewed_features:
    if feature in df_cluster.columns: # Check if feature exists (e.g. if it wasn't dropped)
        # Store original mean and skewness for comparison
        original_mean = df_cluster[feature].mean()
        original_skew = df_cluster[feature].skew()
        
        df_cluster[feature] = np.log1p(df_cluster[feature]) # log1p handles 0 values correctly (log(1+x))
        
        # Store new mean and skewness
        transformed_mean = df_cluster[feature].mean()
        transformed_skew = df_cluster[feature].skew()
        
        print(f"Feature: {feature}")
        print(f"  Original Mean: {original_mean:.2f}, Original Skewness: {original_skew:.2f}")
        print(f"  Transformed Mean: {transformed_mean:.2f}, Transformed Skewness: {transformed_skew:.2f}")
    else:
        print(f"Feature: {feature} not found in df_cluster.columns, skipping transformation.")

print("\nFirst 5 rows of df_cluster after log transformation of skewed features:")
print(df_cluster[skewed_features].head())
print("\n" + "="*50 + "\n")

print("Data preparation (copying, feature definition, OHE, log transform) complete.")
print("Current shape of df_cluster:", df_cluster.shape)

# Next step will be scaling all numerical features.
# Let's update the list of numerical features to include the new one-hot encoded columns
# and exclude the original categorical columns.

# Original numerical features that were not skewed
numerical_features_for_scaling = [feat for feat in numerical_features if feat not in skewed_features]
# Add the (now log-transformed) skewed features
numerical_features_for_scaling.extend(skewed_features) 
# Add the one-hot encoded columns
numerical_features_for_scaling.extend(new_ohe_columns)

# Ensure no duplicates and all features exist
numerical_features_for_scaling = list(set(numerical_features_for_scaling)) 
numerical_features_for_scaling = [feat for feat in numerical_features_for_scaling if feat in df_cluster.columns]

print(f"\nTotal features to be scaled (numerical + OHE): {len(numerical_features_for_scaling)}")
# print("Features to be scaled:", numerical_features_for_scaling) # Can be a long list

Shape of the original DataFrame: (2213, 25)
Shape of the DataFrame for clustering (df_cluster): (2213, 25)
First 5 rows of df_cluster before preprocessing:
    Education Marital_Status   Income  Kidhome  Teenhome  Recency  MntWines  \
0  Graduation         Single  58138.0        0         0       58       635   
1  Graduation         Single  46344.0        1         1       38        11   
2  Graduation       Together  71613.0        0         0       26       426   
3  Graduation       Together  26646.0        1         0       26        11   
4         PhD        Married  58293.0        1         0       94       173   

   MntFruits  MntMeatProducts  MntFishProducts  ...  NumStorePurchases  \
0         88              546              172  ...                  4   
1          1                6                2  ...                  2   
2         49              127              111  ...                 10   
3          4               20               10  ...                  4   

### Recap of Data Preparation Steps Undertaken:

So far in our data preparation for K-Means clustering, we have performed the following crucial transformations on our dataset:

1.  **Created a Dedicated Clustering DataFrame**:
    * A copy of the cleaned dataset from the EDA phase was made (`df_cluster`) to ensure our original data remains intact.

2.  **Organized Features**:
    * We explicitly defined lists to categorize our features:
        * `categorical_features` (e.g., `Education`, `Marital_Status`)
        * `numerical_features` (e.g., `Income`, `Age`, spending amounts, campaign responses)
        * `skewed_features` (a subset of numerical features like `Income` and most spending columns, identified for skewness correction).

3.  **Refined `Marital_Status` Categories**:
    * Very infrequent categories within the `Marital_Status` column (specifically "Alone", "Absurd", "YOLO") were consolidated into a single "Other\_Marital" category. This helps to create more robust features during one-hot encoding and prevents issues with very sparse data.

4.  **Converted Categorical Data to Numerical Format**:
    * One-hot encoding was applied to the `categorical_features` (`Education`, `Marital_Status`). This transformed them into a set of binary (0 or 1) columns, making them suitable for distance-based algorithms like K-Means.

5.  **Addressed Data Skewness**:
    * A log transformation (`np.log1p`) was applied to the `skewed_features`. This process successfully reduced the high skewness observed in these features (like `Income` and various product spending amounts), making their distributions more symmetrical. This is vital for K-Means as it helps to balance the influence of features and prevent outliers from dominating distance calculations. We confirmed the transformation by viewing the head of these transformed features.

Our `df_cluster` DataFrame now has its categorical variables encoded and skewed numerical variables transformed. The next step is to scale all numerical features.

### Feature Scaling

With our categorical features encoded and skewness addressed in relevant numerical features, the final preprocessing step before applying K-Means is to scale our numerical data.

**Why Scale?**
K-Means clustering calculates distances between data points to form clusters (typically Euclidean distance). If features are on vastly different scales (e.g., `Income` in tens of thousands and `Kidhome` as 0, 1, or 2), features with larger values and variances will dominate the distance calculation, effectively overshadowing the contribution of features with smaller values, even if the latter are equally important for defining clusters.

**Method:**
We will use `StandardScaler` from scikit-learn. This method standardizes features by removing the mean and scaling to unit variance. For each feature, it calculates:
$Z = \frac{(X - \mu)}{\sigma}$
where:
* $X$ is the original feature value.
* $\mu$ is the mean of the feature.
* $\sigma$ is the standard deviation of the feature.

The result is that each feature will have a mean of approximately 0 and a standard deviation of approximately 1. This ensures all features contribute more equally to the clustering process.

We will apply this scaling to all features currently in numerical format in our `df_cluster` DataFrame, including the original numerical features, the log-transformed features, and the one-hot encoded binary features.

In [38]:
if 'numerical_features_for_scaling' not in locals() and 'numerical_features_for_scaling' not in globals():
    print("Reconstructing 'numerical_features_for_scaling' list...")
    # Get OHE columns (assuming they were prefixed correctly)
    ohe_cols = [col for col in df_cluster.columns if any(cat_feat + '_' in col for cat_feat in ['Education', 'Marital_Status'])]

    # Combine original numerical, skewed (now transformed), and OHE columns
    # Start with all columns that are not of object type (which should be all after OHE)
    potential_numerical_cols = df_cluster.select_dtypes(include=np.number).columns.tolist()
    
    # Exclude any potential ID or target columns if they were accidentally kept and are numerical
    # For this dataset, all remaining columns in df_cluster should be features for clustering.
    numerical_features_for_scaling = [col for col in potential_numerical_cols if col in df_cluster.columns]
    print(f"Reconstructed list of {len(numerical_features_for_scaling)} features for scaling.")


print(f"Number of features to scale: {len(numerical_features_for_scaling)}")
# print("Features to be scaled (first 10):", numerical_features_for_scaling[:10]) # Display a subset

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the selected features and transform them
# It's good practice to scale only the features intended for the model,
# and keep a copy of the DataFrame with original scales if needed for interpretation later.
# Here, we are transforming df_cluster in place for these columns.
df_cluster[numerical_features_for_scaling] = scaler.fit_transform(df_cluster[numerical_features_for_scaling])

print("\nData successfully scaled using StandardScaler.")
print("First 5 rows of df_cluster after scaling (showing a subset of scaled features):")

# Display the head of a few scaled features to verify
# It's useful to pick a mix if possible: an original numerical, a transformed skewed, and an OHE
display_cols = []
if 'Age' in numerical_features_for_scaling: display_cols.append('Age') # Original numerical
if 'Income' in numerical_features_for_scaling: display_cols.append('Income') # Transformed skewed
# Pick one of the OHE columns, e.g., the first one found from Education
education_ohe_cols = [col for col in numerical_features_for_scaling if 'Education_' in col]
if education_ohe_cols: display_cols.append(education_ohe_cols[0])

# If the specific columns aren't found, just show the first few scaled features
if not display_cols:
    display_cols = numerical_features_for_scaling[:3] # Show first 3 if specific ones not found

print(df_cluster[display_cols].head())

# You can also check the mean and standard deviation of a few columns
print("\nMean of a few scaled features (should be close to 0):")
print(df_cluster[display_cols].mean())
print("\nStandard deviation of a few scaled features (should be close to 1):")
print(df_cluster[display_cols].std())

print("\n" + "="*50 + "\n")
print("All preprocessing steps (OHE, log transform, scaling) are now complete.")
print("The df_cluster DataFrame is ready for determining the optimal 'k' and K-Means clustering.")
print("Current shape of df_cluster:", df_cluster.shape)

Number of features to scale: 34

Data successfully scaled using StandardScaler.
First 5 rows of df_cluster after scaling (showing a subset of scaled features):
        Age    Income  Education_Graduation
0  1.018785  0.429029              0.991451
1  1.275248 -0.019158              0.991451
2  0.334882  0.841102              0.991451
3 -1.289387 -1.113193              0.991451
4 -1.032923  0.434293             -1.008623

Mean of a few scaled features (should be close to 0):
Age                    -4.093728e-17
Income                 -3.748570e-15
Education_Graduation    2.247537e-17
dtype: float64

Standard deviation of a few scaled features (should be close to 1):
Age                     1.000226
Income                  1.000226
Education_Graduation    1.000226
dtype: float64


All preprocessing steps (OHE, log transform, scaling) are now complete.
The df_cluster DataFrame is ready for determining the optimal 'k' and K-Means clustering.
Current shape of df_cluster: (2213, 34)


### Observations on Feature Scaling Output:

The feature scaling process using `StandardScaler` has been successfully applied to all 34 numerical features in our `df_cluster` DataFrame. Key observations from the output include:

1.  **Successful Transformation**:
    * The `StandardScaler` was fitted to the data, and the features were transformed as intended. The head of selected scaled features (`Age`, `Income`, `Education_PhD`) displays values that are now centered around zero and have varying positive and negative signs, which is characteristic of standardized data.

2.  **Verification of Standardization**:
    * **Mean close to Zero**: The mean values for the displayed scaled features (`Age`, `Income`, `Education_PhD`) are extremely close to zero. This confirms that the centering aspect of the standardization was effective.
    * **Standard Deviation close to One**: The standard deviations for these features are very close to one (approximately 1.000226). This indicates that the features have been scaled to unit variance.
    * *Note: Minor deviations from exactly 0 for the mean and exactly 1 for the standard deviation are expected due to floating-point arithmetic and are perfectly normal.*

3.  **Data Readiness**:
    * All features intended for clustering are now on a comparable scale. This prevents features with inherently larger magnitudes or variances from unduly influencing the distance calculations in the K-Means algorithm.
    * The `df_cluster` DataFrame, with its shape remaining `(2213, 34)`, is now fully preprocessed and ready for the next stages: determining the optimal number of clusters (k) and applying the K-Means clustering algorithm.

The data preparation phase is now complete!

### Determining the Optimal Number of Clusters (k)

Before applying K-Means clustering to segment our customers, a crucial step is to determine the optimal number of clusters (`k`) to use. Choosing an appropriate `k` is vital because it directly influences the quality and interpretability of the resulting customer segments. If `k` is too small, we might group dissimilar customers together; if it's too large, we might over-segment, leading to clusters that are too granular and not meaningfully distinct.

We will use a combination of common quantitative methods to guide our choice of `k`. We'll typically look for convergence or points of inflection across these different metrics:

1.  **Elbow Method (Within-Cluster Sum of Squares - WCSS/Inertia)**:
    * This method involves running K-Means clustering for a range of `k` values (e.g., 1 to 15) and calculating the WCSS (also known as inertia) for each `k`. WCSS is the sum of squared distances between each data point and its assigned cluster centroid.
    * We then plot `k` against WCSS. The plot typically shows WCSS decreasing as `k` increases. The "elbow" is the point on the plot where the rate of decrease sharply slows down, suggesting that adding more clusters beyond this point yields diminishing returns in terms of reducing intra-cluster variance.

2.  **Silhouette Analysis**:
    * The silhouette score measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).
    * The silhouette score for a single data point ranges from -1 to +1:
        * **+1**: The data point is far from neighboring clusters and very close to its own.
        * **0**: The data point is close to the decision boundary between two neighboring clusters.
        * **-1**: The data point is likely assigned to the wrong cluster.
    * We calculate the *average silhouette score* for all data points for different values of `k`. A higher average silhouette score generally indicates better-defined clusters. We look for the `k` that maximizes this score.

3.  **Davies-Bouldin Index**:
    * This index measures the average similarity ratio of each cluster with its most similar cluster. Similarity is the ratio of within-cluster distances to between-cluster distances.
    * **Lower Davies-Bouldin Index values indicate better clustering**, as it means clusters are more compact and well-separated from each other.

4.  **Calinski-Harabasz Index (Variance Ratio Criterion)**:
    * This index is defined as the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion for all clusters.
    * **Higher Calinski-Harabasz Index values generally indicate better-defined clusters** (i.e., dense and well-separated clusters).

We will compute these metrics for a range of `k` values and visualize the results to make an informed decision. It's important to note that these metrics provide guidance, and the final choice of `k` might also consider the interpretability and business utility of the resulting segments.

In [39]:
if 'df_cluster' not in locals() and 'df_cluster' not in globals():
    print("Error: df_cluster is not defined. Please ensure previous preprocessing steps were run.")
    # df_cluster = pd.DataFrame(np.random.rand(100, 10)) # Placeholder for code to run
else:
    print(f"Using df_cluster with shape: {df_cluster.shape}")

# Define a range of k values to test
k_range = range(2, 16)

# Lists to store the metrics (this part is identical to before)
inertia_values = []
silhouette_scores = []
davies_bouldin_scores = []
calinski_harabasz_scores = []

print(f"\nCalculating K-Means metrics for k in {list(k_range)}...")

for k_val in k_range:
    kmeans = KMeans(n_clusters=k_val,
                    init='k-means++',
                    n_init='auto',
                    random_state=42)
    cluster_labels = kmeans.fit_predict(df_cluster)
    
    inertia_values.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(df_cluster, cluster_labels))
    davies_bouldin_scores.append(davies_bouldin_score(df_cluster, cluster_labels))
    calinski_harabasz_scores.append(calinski_harabasz_score(df_cluster, cluster_labels))
    
    print(f"  k={k_val}: Inertia={kmeans.inertia_:.2f}, Silhouette={silhouette_scores[-1]:.4f}, Davies-Bouldin={davies_bouldin_scores[-1]:.4f}, Calinski-Harabasz={calinski_harabasz_scores[-1]:.2f}")

print("\nCalculations complete.")
print("Plotting the results using Plotly...")

# --- Plotting the results using Plotly ---

# Create a DataFrame for easier plotting with Plotly Express
plot_df = pd.DataFrame({
    'k': list(k_range),
    'Inertia': inertia_values,
    'Silhouette Score': silhouette_scores,
    'Davies-Bouldin Score': davies_bouldin_scores,
    'Calinski-Harabasz Score': calinski_harabasz_scores
})

# 1. Elbow Method Plot (Inertia)
fig_inertia = px.line(plot_df, x='k', y='Inertia', title='<b>Elbow Method for Optimal k (Inertia/WCSS)</b>',
                      markers=True, line_shape='linear')
fig_inertia.update_xaxes(title_text='Number of Clusters (k)', showgrid=False, tickvals=list(k_range))
fig_inertia.update_yaxes(title_text='Inertia (WCSS)', showgrid=False)
fig_inertia.update_layout(title_font_size=18, title_x=0.5, # Center title
                          font=dict(size=12))
fig_inertia.show()

# 2. Silhouette Score Plot
fig_silhouette = px.line(plot_df, x='k', y='Silhouette Score', title='<b>Silhouette Score for Optimal k</b>',
                         markers=True, line_shape='linear', color_discrete_sequence=['forestgreen'])
fig_silhouette.update_xaxes(title_text='Number of Clusters (k)', showgrid=False, tickvals=list(k_range))
fig_silhouette.update_yaxes(title_text='Average Silhouette Score', showgrid=False)
fig_silhouette.update_layout(title_font_size=18, title_x=0.5, font=dict(size=12))
fig_silhouette.show()

# 3. Davies-Bouldin Index Plot
fig_db = px.line(plot_df, x='k', y='Davies-Bouldin Score', title='<b>Davies-Bouldin Index for Optimal k</b>',
                 markers=True, line_shape='linear', color_discrete_sequence=['darkorange'])
fig_db.update_xaxes(title_text='Number of Clusters (k)', showgrid=False, tickvals=list(k_range))
fig_db.update_yaxes(title_text='Davies-Bouldin Score (Lower is better)', showgrid=False)
fig_db.update_layout(title_font_size=18, title_x=0.5, font=dict(size=12))
fig_db.show()

# 4. Calinski-Harabasz Index Plot
fig_ch = px.line(plot_df, x='k', y='Calinski-Harabasz Score', title='<b>Calinski-Harabasz Index for Optimal k</b>',
                 markers=True, line_shape='linear', color_discrete_sequence=['purple'])
fig_ch.update_xaxes(title_text='Number of Clusters (k)', showgrid=False, tickvals=list(k_range))
fig_ch.update_yaxes(title_text='Calinski-Harabasz Score (Higher is better)', showgrid=False)
fig_ch.update_layout(title_font_size=18, title_x=0.5, font=dict(size=12))
fig_ch.show()


# --- Printing the scores in a table format for easy comparison (this part is identical to before) ---
# The results_df was already created for Plotly, so we can just print it.
print("\nSummary of Metrics for Different k Values:")
print(plot_df.to_string(index=False))

Using df_cluster with shape: (2213, 34)

Calculating K-Means metrics for k in [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]...
  k=2: Inertia=62080.02, Silhouette=0.1715, Davies-Bouldin=2.0551, Calinski-Harabasz=468.77
  k=3: Inertia=59579.23, Silhouette=0.1754, Davies-Bouldin=1.7482, Calinski-Harabasz=290.49
  k=4: Inertia=55877.79, Silhouette=0.1302, Davies-Bouldin=2.2453, Calinski-Harabasz=255.17
  k=5: Inertia=53310.89, Silhouette=0.1233, Davies-Bouldin=2.3060, Calinski-Harabasz=227.08
  k=6: Inertia=51253.39, Silhouette=0.1288, Davies-Bouldin=2.1203, Calinski-Harabasz=206.59
  k=7: Inertia=50103.84, Silhouette=0.1039, Davies-Bouldin=2.4281, Calinski-Harabasz=184.47
  k=8: Inertia=50758.71, Silhouette=0.0829, Davies-Bouldin=2.5590, Calinski-Harabasz=151.94
  k=9: Inertia=48639.97, Silhouette=0.0892, Davies-Bouldin=2.4906, Calinski-Harabasz=150.68
  k=10: Inertia=47766.09, Silhouette=0.0910, Davies-Bouldin=2.4500, Calinski-Harabasz=140.80
  k=11: Inertia=45751.90, Silhouette=0.09


Summary of Metrics for Different k Values:
 k      Inertia  Silhouette Score  Davies-Bouldin Score  Calinski-Harabasz Score
 2 62080.017357          0.171522              2.055088               468.768291
 3 59579.226091          0.175442              1.748238               290.493286
 4 55877.794745          0.130183              2.245344               255.173095
 5 53310.894327          0.123279              2.305964               227.082484
 6 51253.391240          0.128831              2.120290               206.592611
 7 50103.838828          0.103859              2.428055               184.466183
 8 50758.709790          0.082938              2.559032               151.939174
 9 48639.970752          0.089234              2.490627               150.675647
10 47766.094127          0.090995              2.449958               140.800526
11 45751.898918          0.094027              2.268061               141.933349
12 44709.169106          0.102094              2.165814          

### Interpretation of Metrics for Optimal k Selection

After re-examining the clustering evaluation metrics calculated for `k` values ranging from 2 to 15, using the accurate summary table, we can now determine the most suitable number of clusters.

**1. Elbow Method (Inertia/WCSS):**
* **Observation:** The inertia decreases as `k` increases. The drop from k=2 to k=3 is approx. 2501. The drop from k=3 to k=4 is approx. 3702. After k=4, the drops are still present but become less pronounced (e.g., k=4 to k=5 is ~2567; k=5 to k=6 is ~2057; k=6 to k=7 is ~1150). There isn't a single, extremely sharp "elbow." The curve is relatively smooth. The increase in inertia observed at k=8 (50758.70) from k=7 (50103.83) is unusual for K-Means if `random_state` is properly fixed and `n_init` is sufficient; this suggests k=8 might be an unstable configuration for this data or an anomaly in that specific run.
* **Interpretation:** Given the smoother curve, we look for points where further increases in `k` offer less significant reductions. The reduction is still quite good up to k=6 or k=7, after which it clearly flattens (ignoring the k=8 anomaly). If we had to pick an "elbow" where the *rate* changes, points like k=4 or k=6 could be considered, but it's not definitive.

**2. Silhouette Score:**
* **Observation:**
    * k=2: 0.1715
    * **k=3: 0.1754 (This is the highest score)**
    * k=4: 0.1302 (A significant drop from k=3)
    * Subsequent scores remain lower than k=3.
* **Interpretation:** The clear peak for the average Silhouette Score is at **`k=3`**. This suggests that, on average, data points are best assigned to their own cluster and separated from other clusters when `k=3`.

**3. Davies-Bouldin Index:**
* **Observation:** We look for the minimum value.
    * k=2: 2.0551
    * **k=3: 1.7482 (This is the lowest score in the initial, commonly chosen range of k)**
    * k=4: 2.2453 (The score worsens significantly)
    * While scores at k=14 (1.9966) and k=15 (1.9348) are also relatively low, these occur at much higher values of `k` where other metrics have typically degraded, and interpretability can be challenging.
* **Interpretation:** The Davies-Bouldin Index strongly favors **`k=3`** among the more practical choices for `k`, indicating better cluster separation (more compact and well-separated clusters) at this point compared to its immediate neighbors.

**4. Calinski-Harabasz Index:**
* **Observation:** We look for the maximum value.
    * **k=2: 468.77 (This is decisively the highest score)**
    * k=3: 290.49 (A significant drop from k=2)
    * Scores continue to decrease for k>3.
* **Interpretation:** This metric strongly supports **`k=2`**.

#### Choosing the Optimal `k`

Analyzing the corrected metrics:

* **`k=2`**: Strongly favored by the Calinski-Harabasz Index. Shows a decent Silhouette Score.
* **`k=3`**: Emerges as a very strong candidate:
    * **Highest Silhouette Score.**
    * **Best Davies-Bouldin Score** in the typical range of `k` (2 through ~7) before scores at much higher `k`s (like 15) become lower but potentially less meaningful.
    * The Elbow method shows a continued reduction in inertia, and `k=3` is part of this downward trend.
* `k=4` and above: Show a clear degradation in performance for Silhouette Score and Davies-Bouldin Index compared to `k=3`.

**Recommendation for `k`:**

Based on this corrected and more accurate data, **`k=3`** now stands out as the most compelling choice for the number of clusters.

**Justification for `k=3`:**
1.  **Peak Silhouette Score:** `k=3` achieves the highest average Silhouette Score (0.1754). This is a strong indicator that with three clusters, the data points are, on average, more appropriately grouped and better separated from points in other clusters than for any other `k` value tested.
2.  **Optimal Davies-Bouldin Score (Practical Range):** `k=3` yields the lowest Davies-Bouldin score (1.7482) within the generally accepted range of practical `k` values. This suggests that the clusters at `k=3` are more compact and better separated from each other compared to other `k` values in this range.
3.  **Consistency Across Key Metrics:** While `k=2` is favored by the Calinski-Harabasz index, `k=3` is optimal for both Silhouette and Davies-Bouldin (in the practical range), which are often considered very important for assessing cluster quality.
4.  **Deterioration Beyond `k=3`:** There's a noticeable drop in the Silhouette Score and a worsening of the Davies-Bouldin score immediately after `k=3` (i.e., at `k=4`), suggesting that increasing the number of clusters beyond three leads to a poorer clustering structure by these measures.
5.  **Balance and Interpretability:** Three segments can provide a good level of detail for targeted strategies—more nuanced than two, yet typically still manageable and interpretable from a business perspective.

**Conclusion:**
The evidence from the corrected metrics, particularly the peak in the Silhouette Score and the minimum in the Davies-Bouldin Score (within the practical range), now points more decisively towards **`k=3`** as the optimal number of clusters for segmenting your customer data.

### Applying K-Means Clustering and Profiling Segments (k=3)

Based on our analysis of various clustering evaluation metrics, we've chosen `k=3` as the optimal number of clusters for segmenting our customer dataset. The next steps involve:

1.  **Fitting the K-Means Model**:
    * We will train the K-Means algorithm on our fully preprocessed `df_cluster` DataFrame, explicitly setting `n_clusters=3`.
    * A consistent `random_state` will be used to ensure that if we re-run the clustering, we get the same assignments, which is crucial for reproducibility.

2.  **Assigning Cluster Labels**:
    * Once the model is trained, each customer (data point) in `df_cluster` will be assigned to one of the three clusters.
    * We will then add these cluster labels as a new column to our `df_cluster` DataFrame.
    * Crucially, for easier interpretation of cluster characteristics in their original scales, we will also add these cluster labels back to a version of our DataFrame that existed *before* scaling and log transformation but *after* cleaning and one-hot encoding. Let's call this `df_for_profiling`. If you don't have this exact DataFrame handy, we can recreate it or use the original `df` and selectively bring in one-hot encoded columns.

3.  **Cluster Profiling and Analysis**:
    * This is where we delve into the "personality" of each segment. Our goal is to identify the distinguishing characteristics of customers in each of the three clusters. We'll do this by:
        * **Calculating Descriptive Statistics**: For each cluster, we will compute:
            * Mean values for numerical features (e.g., average income, average age, average spending on different product categories).
            * Value counts (or proportions) for categorical/binary features (e.g., distribution of education levels, marital statuses, campaign acceptance rates within each cluster).
            * We'll perform this on `df_for_profiling` to interpret values like income and spending in their original, understandable units.
        * **Visualizing Cluster Characteristics**: We'll create various plots to compare the segments:
            * **Bar Charts**: To compare the distribution of categorical features (like original `Education`, `Marital_Status` before OHE, or the OHE features themselves) across the three clusters.
            * **Box Plots or Violin Plots**: To compare the distributions of key numerical features (e.g., `Income`, `Age`, `MntWines`, `NumStorePurchases`) across the clusters. This helps see differences in medians, spreads, and identify potential outliers within segments.
            * **Radar Charts (Optional but good for summaries)**: These can provide a holistic view by plotting the average values of several key (normalized) features for each cluster on a single chart.
        * **Developing Segment Personas**: Based on the dominant characteristics, we will try to give each segment a descriptive name or persona (e.g., "High-Value Regulars," "Budget-Conscious Occasionals," "Emerging Young Spenders"). This makes the segments more tangible for business stakeholders.

4.  **Visualizing Cluster Separation**:
    * To get a visual sense of how distinct the clusters are in the feature space, we can apply Principal Component Analysis (PCA) to our scaled `df_cluster` data (the one used for clustering) to reduce its dimensionality to 2 or 3 components.
    * We can then create a scatter plot of these principal components, coloring the data points according to their assigned cluster label. This doesn't show the full picture (since we're reducing many dimensions), but it can give an intuitive feel for the cluster separation.

Let's start by fitting the K-Means model with `k=3` and adding the cluster labels to our data.

In [40]:
# --- Step 1: Fit the K-Means Model with k=3 ---
optimal_k = 3

kmeans_final = KMeans(n_clusters=optimal_k,
                      init='k-means++',
                      n_init='auto',
                      random_state=42) # Use the same random_state for consistency

# Fit the model to the fully preprocessed data
kmeans_final.fit(df_cluster) # df_cluster is your scaled and transformed data for clustering

print(f"K-Means model fitted with k={optimal_k} clusters.")

# --- Step 2: Assign Cluster Labels ---
# Get the cluster labels for each data point
cluster_labels_final = kmeans_final.labels_

# Add the cluster labels to the scaled DataFrame (df_cluster)
df_cluster_labeled = df_cluster.copy()
df_cluster_labeled['Cluster'] = cluster_labels_final
print(f"\nCluster labels added to 'df_cluster_labeled'. Shape: {df_cluster_labeled.shape}")
print("Value counts for each cluster in 'df_cluster_labeled':")
print(df_cluster_labeled['Cluster'].value_counts().sort_index())


# --- Prepare a DataFrame for Profiling (with original scales where possible) ---
# We need a DataFrame that has:
# - Original numerical features (before log transform and scaling)
# - Original categorical features (or their one-hot encoded versions if originals were dropped)
# - The new 'Cluster' labels

# Let's assume 'df' is your DataFrame after initial cleaning (missing values handled, 'Age' created, etc.)
# but BEFORE log transformation and scaling.
# It should still have columns like 'Education', 'Marital_Status' in their original categorical form
# or after you consolidated rare categories in Marital_Status.

# Create a copy of this 'df' for profiling.
# If your original 'df' from EDA was modified, you might need to reload or reconstruct it
# up to the point before skewness handling and scaling.
# For this example, I'll assume 'df' is the one used to create 'df_cluster' initially.
# This 'df' should have the same number of rows as df_cluster.

df_profiling = df.copy() # 'df' is the DataFrame after cleaning, feature engineering, and outlier removal from EDA

# Add the cluster labels. Ensure the index aligns.
# Since df_cluster was derived from df (and rows were only dropped, not reordered extensively without index reset),
# the indices should still align. If you reset index on df_cluster at some point, ensure alignment.
if len(df_profiling) == len(cluster_labels_final):
    df_profiling['Cluster'] = cluster_labels_final
    print(f"\nCluster labels added to 'df_profiling'. Shape: {df_profiling.shape}")
    print("First 5 rows of 'df_profiling' with Cluster labels:")
    print(df_profiling.head())
else:
    print(f"\nError: Length mismatch between df_profiling ({len(df_profiling)}) and cluster_labels ({len(cluster_labels_final)}).")
    print("Ensure df_profiling corresponds to the data used for clustering before adding labels.")
    # You might need to re-align df_profiling with the index of df_cluster if it was changed
    # For example, if df_cluster had its index reset:
    # df_profiling_aligned = df.loc[df_cluster.index]
    # df_profiling_aligned['Cluster'] = cluster_labels_final
    # print(df_profiling_aligned.head())


# --- Initial look at Cluster Centroids (on scaled data) ---
# Centroids are in the scaled feature space
centroids_scaled = kmeans_final.cluster_centers_
df_centroids_scaled = pd.DataFrame(centroids_scaled, columns=df_cluster.columns) # Excludes the 'Cluster' column we just added to df_cluster_labeled
print("\nCluster Centroids (in scaled feature space):")
print(df_centroids_scaled)

print("\n" + "="*50 + "\n")
print("Next steps will involve detailed profiling of these clusters using 'df_profiling'.")

K-Means model fitted with k=3 clusters.

Cluster labels added to 'df_cluster_labeled'. Shape: (2213, 35)
Value counts for each cluster in 'df_cluster_labeled':
Cluster
0    1110
1    1049
2      54
Name: count, dtype: int64

Cluster labels added to 'df_profiling'. Shape: (2213, 26)
First 5 rows of 'df_profiling' with Cluster labels:
    Education Marital_Status   Income  Kidhome  Teenhome  Recency  MntWines  \
0  Graduation         Single  58138.0        0         0       58       635   
1  Graduation         Single  46344.0        1         1       38        11   
2  Graduation       Together  71613.0        0         0       26       426   
3  Graduation       Together  26646.0        1         0       26        11   
4         PhD        Married  58293.0        1         0       94       173   

   MntFruits  MntMeatProducts  MntFishProducts  ...  NumWebVisitsMonth  \
0         88              546              172  ...                  7   
1          1                6             

#### Cluster Profiling and Analysis:

To begin understanding the distinct characteristics of our three customer segments, we will calculate and compare the average (mean) values of key numerical features for each cluster. This will help us identify differences in demographics, purchasing behavior, and engagement levels. We will use the `df_profiling` DataFrame, which contains our features in their original, interpretable units.

Key numerical features to examine include:
* **Demographics & Household**: `Age`, `Income`, `Kidhome`, `Teenhome`.
* **Spending Behavior**: `MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`.
* **Purchase Behavior**: `NumDealsPurchases`, `NumWebPurchases`, `NumCatalogPurchases`, `NumStorePurchases`, `Recency`.
* **Engagement**: `NumWebVisitsMonth`.
* **Campaign Acceptance & Complaints (as proportions)**: `AcceptedCmp1`, `AcceptedCmp2`, `AcceptedCmp3`, `AcceptedCmp4`, `AcceptedCmp5`, `Complain`, `Response`.

By comparing these means, we can start to build a picture of each segment.

In [41]:
if 'df_profiling' not in locals() and 'df_profiling' not in globals():
    print("Error: df_profiling is not defined. Please ensure previous steps were run correctly.")
    # This is a fallback, you should have df_profiling from the previous step.
    # For the code to run if it's missing for some reason:
    # data_for_fallback = {
    # 'Income': np.random.randint(20000, 100000, 100), 'Age': np.random.randint(25, 70, 100),
    # 'MntWines': np.random.randint(0, 1000, 100), 'NumStorePurchases': np.random.randint(0,10,100),
    # 'Cluster': np.random.choice([0,1,2], 100)
    # }
    # df_profiling = pd.DataFrame(data_for_fallback)
else:
    print(f"Using df_profiling with shape: {df_profiling.shape} for calculating mean characteristics.")

# Define the list of numerical features we want to profile by their mean
# These should be present in your df_profiling DataFrame
numerical_cols_for_profiling = [
    'Income', 'Age', 'Kidhome', 'Teenhome', 'Recency',
    'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
    'MntSweetProducts', 'MntGoldProds',
    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
    'NumWebVisitsMonth',
    # Campaign acceptance, complain, and overall response are binary (0/1),
    # so their mean will represent the proportion of '1's (e.g., acceptance rate)
    'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5',
    'Complain', 'Response'
]

# Ensure all selected columns exist in df_profiling to avoid errors
existing_numerical_cols = [col for col in numerical_cols_for_profiling if col in df_profiling.columns]
if len(existing_numerical_cols) != len(numerical_cols_for_profiling):
    print("\nWarning: Not all specified numerical_cols_for_profiling were found in df_profiling.")
    print(f"Found: {existing_numerical_cols}")
    print(f"Missing: {list(set(numerical_cols_for_profiling) - set(existing_numerical_cols))}")
    numerical_cols_for_profiling = existing_numerical_cols # Use only existing columns

if not numerical_cols_for_profiling:
    print("Error: No valid numerical columns found for profiling. Please check your column list and DataFrame.")
else:
    # Group by 'Cluster' and calculate the mean for the selected numerical columns
    cluster_means_numerical = df_profiling.groupby('Cluster')[numerical_cols_for_profiling].mean()

    print("\nMean values of numerical features for each cluster:")
    # Transpose for better readability if there are many features, or print as is.
    # For a moderate number of features, not transposing might be fine. Let's see.
    # If you prefer features as rows and clusters as columns, use .T
    print(cluster_means_numerical)
    # Example of transposing if the table is too wide:
    # print(cluster_means_numerical.T)

    # Optional: Rounding for cleaner presentation
    # print("\nMean values of numerical features for each cluster (rounded):")
    # print(cluster_means_numerical.round(2)) # Round to 2 decimal places

    print("\n" + "="*50 + "\n")
    print("Next, we will analyze categorical features and then visualize these profiles.")


Using df_profiling with shape: (2213, 26) for calculating mean characteristics.

Mean values of numerical features for each cluster:
               Income        Age   Kidhome  Teenhome    Recency    MntWines  \
Cluster                                                                       
0        67599.400901  58.107207  0.114414  0.506306  49.266667  540.843243   
1        37624.101049  54.380362  0.778837  0.526215  48.762631   71.094376   
2        20306.259259  47.537037  0.629630  0.092593  48.444444    7.240741   

         MntFruits  MntMeatProducts  MntFishProducts  MntSweetProducts  ...  \
Cluster                                                                 ...   
0        48.040541       303.525225        68.800901         49.209910  ...   
1         4.127741        30.464252         5.716873          4.338418  ...   
2        11.111111        11.444444        17.055556         12.111111  ...   

         NumCatalogPurchases  NumStorePurchases  NumWebVisitsMonth  \
Clust

#### Interpretation of Cluster Means (Numerical Features)

The table of mean values for numerical features provides significant insights into the characteristics of our three customer segments. We are using `df_profiling` to ensure these means are in their original, interpretable units.

**Cluster Sizes (from K-Means `value_counts()` output):**
* **Cluster 0:** 1110 customers
* **Cluster 1:** 1049 customers
* **Cluster 2:** 54 customers (a notably small segment)

Here's a comparative analysis based on the means from your output:

**A. Demographics & Household Structure:**

* **Income:**
    * **Cluster 0:** Highest average income (approx. \$67,599).
    * **Cluster 1:** Low-Middle average income (approx. \$37,624).
    * **Cluster 2:** Lowest average income (approx. \$20,306).
    * *Insight:* Income clearly delineates the clusters, with Cluster 0 being the most affluent.

* **Age:**
    * **Cluster 0:** Oldest on average (approx. 58.1 years).
    * **Cluster 1:** Middle-aged (approx. 54.4 years).
    * **Cluster 2:** Youngest on average (approx. 47.5 years).
    * *Insight:* Age is another differentiating factor.

* **Kidhome (Number of Young Children):**
    * **Cluster 1:** Highest average number of young children (approx. 0.78).
    * **Cluster 2:** Also a high average (approx. 0.63).
    * **Cluster 0:** Very few young children (approx. 0.11).
    * *Insight:* Clusters 1 and 2 are more likely to have young children at home.

* **Teenhome (Number of Teenagers):**
    * **Cluster 1:** Highest average number of teenagers (approx. 0.53).
    * **Cluster 0:** Similar average number of teenagers (approx. 0.51).
    * **Cluster 2:** Very few teenagers (approx. 0.09).
    * *Insight:* Cluster 2 households have notably fewer teenagers.

**B. Spending Behavior (Average Amount Spent - "Mnt" features):**

* **Overall Spending Pattern:**
    * **Cluster 0:** Exhibits substantially higher spending across all primary product categories (`MntWines`: ~\$541, `MntFruits`: ~\$48, `MntMeatProducts`: ~\$304, `MntFishProducts`: ~\$69, `MntSweetProducts`: ~\$49, `MntGoldProds`: ~\$76).
    * **Cluster 1:** Low spenders in comparison (e.g., `MntWines`: ~\$71, `MntMeatProducts`: ~\$30).
    * **Cluster 2:** The lowest spenders across most categories (e.g., `MntWines`: ~\$7, `MntMeatProducts`: ~\$11).
    * *Insight:* Cluster 0 are the clear high-value customers in terms of total expenditure.

**C. Purchase Behavior (Average Number of Purchases - "Num" features & Recency):**

* **NumDealsPurchases:**
    * **Cluster 2:** Makes the most purchases with deals (approx. 3.7).
    * **Cluster 1:** Moderate deal purchases (approx. 2.2).
    * **Cluster 0:** Fewest deal purchases (approx. 1.9).
    * *Insight:* Cluster 2 shows a strong preference for deals, while Cluster 0 (high spenders) uses them least.

* **Purchase Channels:**
    * **NumWebPurchases:** Cluster 0 (5.8) > Cluster 1 (2.5) > Cluster 2 (2.3).
    * **NumCatalogPurchases:** Cluster 0 (4.6) >> Cluster 1 (0.7) & Cluster 2 (0.5).
    * **NumStorePurchases:** Cluster 0 (8.2) >> Cluster 1 (3.4) & Cluster 2 (2.9).
    * *Insight:* Cluster 0 not only spends more but also makes a higher volume of purchases across all channels, particularly excelling in catalog and store purchases.

* **Recency (Days since last purchase):**
    * Cluster 0: ~49.3 days
    * Cluster 1: ~48.8 days
    * Cluster 2: ~48.4 days
    * *Insight:* The average recency is very similar across the clusters, indicating this feature doesn't strongly differentiate the segments by its mean.

**D. Engagement & Campaign Response:**

* **NumWebVisitsMonth:**
    * **Cluster 2:** Highest average web visits (approx. 6.9).
    * **Cluster 1:** Also high web visits (approx. 6.5).
    * **Cluster 0:** Lowest average web visits (approx. 4.1).
    * *Insight:* The lower-spending clusters (1 and 2) visit the website more frequently. This could suggest more browsing or price comparison behavior.

* **Campaign Acceptance (`AcceptedCmp1-5`, `Response` - latest campaign):**
    * **Cluster 0:** Shows significantly higher acceptance rates for most campaigns (Cmp1: ~12%, Cmp2: ~3%, Cmp4: ~12%, Cmp5: ~15%) and the overall latest campaign (`Response`: ~20%).
    * **Cluster 1:** Low acceptance for earlier campaigns; moderate for the latest (`Response`: ~10%).
    * **Cluster 2:** Very low to no acceptance for most campaigns (Cmp1, Cmp2, Cmp4, Cmp5 are 0%). Lowest `Response` rate (~4%).
    * **Cmp3 (an outlier campaign):** Has low acceptance across all, with Cluster 2 showing a slightly higher rate (~11%) than C0 (~8%) and C1 (~7%).
    * *Insight:* Cluster 0 is the most responsive to marketing initiatives overall. Cluster 2 is largely unresponsive.

* **Complain:**
    * Complaints are minimal across all segments.
    * **Cluster 1:** Shows a slightly higher tendency (approx. 1.1%).
    * **Cluster 0:** Low (approx. 0.7%).
    * **Cluster 2:** No complaints recorded (0%).
    * *Insight:* Cluster 2 stands out for having no complaints, though rates are low everywhere.

#### Preliminary Segment Profiles (Based on Current Output):

* **Cluster 0: "Affluent & Engaged High-Value Customers" (Largest Segment: 1110 customers)**
    * **Characteristics:** Highest income, oldest group, fewest young children, moderate teens. Dominant spenders across all product categories. High purchase volume via web, catalog, and stores. Uses deals least. Visits website less frequently than other segments (perhaps more targeted purchasing). Most responsive to marketing campaigns.
    * **Key Differentiators:** High income, high total spending, high multi-channel purchasing volume, high campaign responsiveness.

* **Cluster 1: "Middle-Income Families - Moderate Web Browsers" (1049 customers)**
    * **Characteristics:** Low-Middle income, middle-aged, many young children and teens. Low overall spending. Low purchase volume across channels. Moderate deal usage. High number of monthly web visits. Moderately responsive to the latest marketing campaign, but less so to earlier ones.
    * **Key Differentiators:** Presence of children/teens, low-mid income, low spending levels, high web visit frequency, moderate engagement with recent campaigns.

* **Cluster 2: "Youngest, Low-Income Deal-Seekers - Low Engagement & Niche Size" (Smallest Segment: 54 customers)**
    * **Characteristics:** Youngest, lowest income, many young children but very few teens. The lowest overall spenders. Lowest purchase volume but highest use of deals. Highest number of monthly web visits (suggesting browsing). Least responsive to marketing campaigns. Notably, this is a very small segment.
    * **Key Differentiators:** Youngest, lowest income, high `Kidhome`, very low spending, highest deal usage, very low campaign response. The small size of this cluster warrants attention—it could represent a distinct niche or potentially a group influenced by outliers if not carefully handled during preprocessing.

#### Profiling Categorical Features: Distribution within Clusters

Now that we have a sense of the numerical differences between the clusters, let's examine the categorical features to see how their distributions vary across the segments. This will help us understand attributes like the predominant education levels and marital statuses within each cluster.

We will analyze:
* **`Education`**: The distribution of education levels (e.g., Graduation, PhD, Master, etc.) in each cluster.
* **`Marital_Status`**: The distribution of marital statuses (e.g., Married, Together, Single, Divorced, Widow, Other_Marital) in each cluster.

We will calculate the proportion of each category for these features within each of the three clusters using the `df_profiling` DataFrame.

In [42]:
if 'df_profiling' not in locals() and 'df_profiling' not in globals():
    print("Error: df_profiling is not defined. Please ensure previous steps were run correctly.")
    # Fallback for df_profiling if it's somehow not defined
    # This is illustrative; df_profiling should exist from prior steps.
    # data_for_fallback_cat = {
    # 'Education': np.random.choice(['Graduation', 'PhD', 'Master', '2n Cycle', 'Basic'], 100),
    # 'Marital_Status': np.random.choice(['Married', 'Single', 'Together', 'Divorced', 'Widow', 'Other_Marital'], 100),
    # 'Cluster': np.random.choice([0,1,2], 100)
    # }
    # df_profiling = pd.DataFrame(data_for_fallback_cat)
else:
    print(f"Using df_profiling with shape: {df_profiling.shape} for analyzing categorical features.")

# List of categorical columns to analyze
# These should be the original categorical columns in df_profiling
# (after any pre-OHE consolidation like grouping rare Marital_Status)
categorical_cols_to_profile = ['Education', 'Marital_Status']

# Ensure these columns exist
existing_categorical_cols = [col for col in categorical_cols_to_profile if col in df_profiling.columns]
if len(existing_categorical_cols) != len(categorical_cols_to_profile):
    print(f"\nWarning: Not all specified categorical_cols_to_profile were found in df_profiling.")
    print(f"Found: {existing_categorical_cols}")
    print(f"Missing: {list(set(categorical_cols_to_profile) - set(existing_categorical_cols))}")
    categorical_cols_to_profile = existing_categorical_cols

if not categorical_cols_to_profile:
    print("Error: No valid categorical columns found for profiling. Please check your column list and DataFrame.")
else:
    for col in categorical_cols_to_profile:
        print(f"\n--- Distribution of '{col}' within each Cluster ---")
        # Group by cluster and then get value counts for the categorical column, normalized to get proportions
        # Using as_index=False with groupby().value_counts() can make it easier to unstack or pivot later if needed for combined display
        # but for separate prints, a loop is clear.
        
        # Calculate proportions
        proportions = df_profiling.groupby('Cluster')[col].value_counts(normalize=True).mul(100).round(2)
        
        # To display it nicely, perhaps unstack if it makes sense, or just print the multi-index series
        print(proportions.unstack(fill_value=0)) # Unstack to get categories as columns, fill_value=0 for categories not present in a cluster
        
        # Alternative: Loop through clusters and print value_counts for each
        # print("\nAlternative display (value counts per cluster):")
        # for cluster_num in sorted(df_profiling['Cluster'].unique()):
        #     print(f"\nCluster {cluster_num}:")
        #     print(df_profiling[df_profiling['Cluster'] == cluster_num][col].value_counts(normalize=True).mul(100).round(2))

    print("\n" + "="*50 + "\n")
    print("Categorical feature analysis complete.")
    print("Next, we will proceed to visualize both numerical and categorical profiles.")

Using df_profiling with shape: (2213, 26) for analyzing categorical features.

--- Distribution of 'Education' within each Cluster ---
Education  2n Cycle  Basic  Graduation  Master    PhD
Cluster                                              
0               8.2    0.0       53.78   15.32  22.70
1              10.2    0.0       49.48   18.59  21.73
2               0.0  100.0        0.00    0.00   0.00

--- Distribution of 'Marital_Status' within each Cluster ---
Marital_Status  Absurd  Alone  Divorced  Married  Single  Together  Widow  \
Cluster                                                                     
0                 0.18   0.00     10.81    37.93   20.90     25.68   4.50   
1                 0.00   0.29     10.49    39.66   20.97     26.02   2.38   
2                 0.00   0.00      1.85    37.04   33.33     25.93   1.85   

Marital_Status  YOLO  
Cluster               
0               0.00  
1               0.19  
2               0.00  


Categorical feature analysis c

### Interpretation of Categorical Feature Distributions

Following the analysis of numerical features, we now examine the distribution of `Education` and `Marital_Status` within each of the three customer segments, using the `df_profiling` DataFrame. The percentages represent the proportion of customers within each cluster belonging to that specific category.

**Cluster Sizes (Recap):**
* **Cluster 0:** 1110 customers
* **Cluster 1:** 1049 customers
* **Cluster 2:** 54 customers (a very small segment)

**A. Education Level Distribution:**

| Education  | Cluster 0 (%) | Cluster 1 (%) | Cluster 2 (%) |
|------------|---------------|---------------|---------------|
| 2n Cycle   | 8.20          | 10.20         | 0.00          |
| Basic      | 0.00          | 0.00          | 100.00        |
| Graduation | 53.78         | 49.48         | 0.00          |
| Master     | 15.32         | 18.59         | 0.00          |
| PhD        | 22.70         | 21.73         | 0.00          |

* **Cluster 0 ("Affluent & Engaged High-Value Customers"):**
    * Primarily "Graduation" (53.8%), with significant proportions having "PhD" (22.7%) and "Master" (15.3%). This indicates a highly educated segment. "2n Cycle" is a smaller group, and no one has "Basic" education.
* **Cluster 1 ("Middle-Income Families - Moderate Web Browsers"):**
    * Similar to Cluster 0, dominated by "Graduation" (49.5%), "PhD" (21.7%), and "Master" (18.6%). Also a highly educated group. "2n Cycle" is slightly higher than in Cluster 0. No "Basic" education.
* **Cluster 2 ("Youngest, Low-Income Deal-Seekers - Niche Size"):**
    * This is a striking finding: **100% of customers in this very small cluster have "Basic" education.** This is a powerful defining characteristic for this segment.

*Insight on Education:* Clusters 0 and 1 are both highly educated, with a majority having graduate degrees or higher. Cluster 2 is exclusively composed of individuals with a "Basic" education level, making it unique in this regard.

**B. Marital Status Distribution:**

*(Note: The categories "Absurd", "Alone", "YOLO" are present with very small percentages, indicating the Marital_Status column used for this profiling step reflects the original detailed categories before any potential consolidation into an "Other_Marital" group that might have been done for preprocessing df_cluster).*

| Marital_Status | Cluster 0 (%) | Cluster 1 (%) | Cluster 2 (%) |
|----------------|---------------|---------------|---------------|
| Absurd         | 0.18          | 0.00          | 0.00          |
| Alone          | 0.00          | 0.29          | 0.00          |
| Divorced       | 10.81         | 10.49         | 1.85          |
| Married        | 37.93         | 39.66         | 37.04         |
| Single         | 20.90         | 20.97         | 33.33         |
| Together       | 25.68         | 26.02         | 25.93         |
| Widow          | 4.50          | 2.38          | 1.85          |
| YOLO           | 0.00          | 0.19          | 0.00          |

* **Cluster 0 ("Affluent & Engaged High-Value Customers"):**
    * Predominantly "Married" (37.9%) or "Together" (25.7%), totaling about 63.6% in a committed relationship. A notable portion is "Single" (20.9%), and "Divorced" (10.8%). Few "Widow" (4.5%).
* **Cluster 1 ("Middle-Income Families - Moderate Web Browsers"):**
    * Very similar to Cluster 0: primarily "Married" (39.7%) or "Together" (26.0%), totaling about 65.7%. "Single" (21.0%) and "Divorced" (10.5%) are also significant. Fewer "Widow" (2.4%).
* **Cluster 2 ("Youngest, Low-Income Deal-Seekers - Niche Size"):**
    * This cluster shows a different pattern. While "Married" is still the largest single group (37.0%), "Single" is notably higher (33.3%) compared to Clusters 0 and 1. "Together" is also present (25.9%). "Divorced" and "Widow" are much lower.
    * The rare categories "Absurd", "Alone", "YOLO" have minimal presence, mostly in Clusters 0 and 1.

*Insight on Marital Status:* Clusters 0 and 1 have fairly similar marital status distributions, with a majority being married or together. Cluster 2 has a higher proportion of "Single" individuals compared to the other two, and very few "Divorced" or "Widow".

#### Updated Preliminary Segment Profiles (Integrating Categorical Insights):

Let's refine our segment personas with this new information:

* **Cluster 0: "Affluent, Highly Educated & Established Customers" (Largest Segment: 1110)**
    * **Demographics:** Highest income, oldest group, very few young children, moderate teens. **Highly educated (Graduation, PhD, Master).** Predominantly Married/Together.
    * **Spending & Behavior:** Highest spenders, high multi-channel purchase volume, low deal usage, low web visits, highly responsive to campaigns.
    * *Key Differentiators:* High income, high education, older, high spending, high campaign response.

* **Cluster 1: "Educated, Middle-Income Families - Cautious Online Browsers" (1049)**
    * **Demographics:** Low-Middle income, middle-aged, many young children and teens. **Highly educated (Graduation, PhD, Master).** Predominantly Married/Together.
    * **Spending & Behavior:** Low overall spending, low purchase volume, moderate deal usage, high web visits, moderately responsive to recent campaigns.
    * *Key Differentiators:* Highly educated, presence of children/teens, low-mid income, low spending, high web visits.

* **Cluster 2: "Youngest, Basic Education, Low-Income Deal Seekers" (Smallest & Niche Segment: 54)**
    * **Characteristics:** Youngest, lowest income, many young children but very few teens. **Exclusively "Basic" education level.** Higher proportion of "Single" individuals. Very low spenders, highest deal usage, highest web visits, least responsive to campaigns. No complaints.
    * **Key Differentiators:** Youngest, lowest income, **Basic education**, high `Kidhome`, very low spending, high deal usage, very low campaign response. The "Basic" education is a very strong and unique marker for this small group.

#### Visualizing Cluster Profiles with Plotly

To better understand and communicate the distinct characteristics of our three customer segments, we will now create interactive visualizations using Plotly. These plots will help illustrate the differences in numerical feature distributions and categorical feature compositions across the clusters, offering a more dynamic way to explore the data.

We will visualize:
1.  **Distributions of Key Numerical Features per Cluster**: Using box plots to compare features like `Income`, `Age`, key spending categories, and key purchase behaviors.
2.  **Distributions of Categorical Features per Cluster**: Using grouped bar charts to show the proportion of each category for `Education` and `Marital_Status` within each cluster.

In [43]:
if 'df_profiling' not in locals() or 'df_profiling' not in globals():
    print("Error: df_profiling is not defined. Please ensure previous steps were run correctly.")
    # Dummy df_profiling for code execution
    data_for_fallback_viz = {
        'Income': np.concatenate([np.random.randint(20000, 150000, 99), [666666]]),
        'Age': np.random.randint(25, 70, 100),
        'MntWines': np.random.randint(0, 1000, 100),
        'NumStorePurchases': np.random.randint(0,10,100),
        'Education': np.random.choice(['Graduation', 'PhD', 'Master', '2n Cycle', 'Basic'], 100),
        'Marital_Status': np.random.choice(['Married', 'Single', 'Together', 'Divorced', 'Widow', 'Other_Marital'], 100),
        'Cluster': np.random.choice([0,1,2], 100)
    }
    df_profiling = pd.DataFrame(data_for_fallback_viz)
else:
    print(f"Using df_profiling with shape: {df_profiling.shape} for visualizing cluster profiles with Plotly.")

# --- Define Custom Colors for Clusters ---
# For numerical plots where 'Cluster' column is integer
cluster_colors_map = {
    0: '#0B0055', # Dark Blue/Purple
    1: '#F86302', # Orange
    2: '#2DB762'  # Green
}
# For categorical plots where 'Cluster' column will be string (as per your code)
cluster_cols_list = ['#0B0055', '#F86302', '#2DB762'] # Order for '0', '1', '2'


# --- 1. Visualizing Distributions of Key Numerical Features per Cluster (Horizontal, Custom Colors, Outlier Handling for Income) ---

numerical_features_for_viz = [
    'Income', 'Age',
    'MntWines', 'MntMeatProducts', 'MntGoldProds',
    'NumStorePurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumDealsPurchases',
    'Kidhome', 'Teenhome',
    'NumWebVisitsMonth', 'Response'
]

print("\nGenerating Plotly horizontal box plots for key numerical features by Cluster...")
for feature in numerical_features_for_viz:
    if feature in df_profiling.columns:
        plot_data_numerical = df_profiling.copy()

        title_suffix = ""
        if feature == 'Income':
            income_visualization_cap = 200000 # Cap for visualization
            original_max = plot_data_numerical[feature].max()
            plot_data_numerical = plot_data_numerical[plot_data_numerical[feature] < income_visualization_cap]
            if original_max >= income_visualization_cap:
                title_suffix = f"<br>(Displayed for Incomes < ${income_visualization_cap:,} to improve clarity)"

        fig_numerical = px.box(plot_data_numerical, x=feature, y='Cluster',
                               color='Cluster',
                               color_discrete_map=cluster_colors_map, # Using map for integer cluster labels
                               orientation='h',
                               title=f'<b>Distribution of {feature} by Cluster</b>{title_suffix}',
                               labels={'Cluster': 'Cluster', feature: feature},
                               points="outliers")

        fig_numerical.update_layout(
            title_x=0.5,
            yaxis=dict(showgrid=False, title_font=dict(size=12), tickfont=dict(size=10), type='category'),
            xaxis=dict(showgrid=False, title_font=dict(size=12), tickfont=dict(size=10)),
            font=dict(size=11),
            legend_title_text='Cluster'
        )
        fig_numerical.update_yaxes(categoryorder='array', categoryarray=sorted(plot_data_numerical['Cluster'].unique()))
        fig_numerical.show()
    else:
        print(f"Warning: Feature '{feature}' not found in df_profiling for numerical visualization.")

# --- 2. Visualizing Distributions of Categorical Features per Cluster (User's Preferred Method) ---

print("\nGenerating Plotly bar charts for categorical features by Cluster (user's method)...")

df_plot_categorical = df_profiling.copy()
df_plot_categorical["Cluster"] = df_plot_categorical["Cluster"].astype(str) # Treat clusters as categories for Plotly Express sequence

cats_to_vis  = ['Education', 'Marital_Status']

for col in cats_to_vis:
    if col not in df_plot_categorical.columns:
        print(f"⚠️  {col} not found; skipping.")
        continue

    # Calculate % of each cluster *within* every category value of 'col'
    # This shows: for each Education level, what's the cluster breakdown?
    pct_data = (pd.crosstab(df_plot_categorical[col], df_plot_categorical["Cluster"], normalize="index") * 100).round(2)
    
    # Melt the data into a long format suitable for Plotly Express
    long_data = pct_data.reset_index().melt(id_vars=col,
                                            var_name="Cluster",
                                            value_name="Percent")

    # Build grouped-bar chart
    fig_categorical = px.bar(
        long_data,
        x=col, y="Percent",
        color="Cluster",
        color_discrete_sequence=cluster_cols_list, # Use the list for string cluster labels
        barmode="group",
        text="Percent", # Show values on bars
        title=f"<b>Cluster Composition within each '{col}' Category (%)</b>" # Title reflects the crosstab normalization
    )

    # Cosmetic tweaks from user's suggestion
    fig_categorical.update_traces(texttemplate="%{text:.1f}%", textposition="outside", cliponaxis=False)
    fig_categorical.update_layout(
        title_x=0.5,
        yaxis_title="Percentage of Category's Customers (%)", # Clarified y-axis label
        xaxis_title=col,
        legend_title_text="Cluster",
        bargap=0.30,
        plot_bgcolor="white",
        uniformtext_minsize=8, uniformtext_mode="hide",
        height=500, # Adjusted height slightly
        width=900  # Adjusted width slightly
    )
    fig_categorical.show()

print("\n" + "="*50 + "\n")
print("Plotly visual profiling of clusters complete.")

Using df_profiling with shape: (2213, 26) for visualizing cluster profiles with Plotly.

Generating Plotly horizontal box plots for key numerical features by Cluster...



Generating Plotly bar charts for categorical features by Cluster (user's method)...




Plotly visual profiling of clusters complete.


#### **Consolidating Insights: Final Segment Personas and Summary**

Now that we have explored the numerical and categorical characteristics of our three customer segments, both through statistical summaries and visualizations, it's time to bring all these insights together. The goal is to create clear, concise, and memorable personas for each segment.

This involves:

1.  **Reviewing All Evidence:** Look back at:
    * The mean values of numerical features (Income, Age, spending, purchase behavior, engagement, campaign response).
    * The distribution of categorical features (Education, Marital Status).
    * The key takeaways from the visualizations (box plots and bar charts).

2.  **Identifying Key Differentiating Characteristics for Each Cluster:** For each cluster, pinpoint the 3-5 most prominent and differentiating attributes that set it apart from the others.

3.  **Crafting a Narrative/Persona for Each Cluster:**
    * Give each cluster a descriptive and memorable name (e.g., "Affluent Achievers," "Budget-Conscious Families," "Young Potentials"). We had some preliminary ideas, let's refine them.
    * Write a short paragraph for each segment summarizing its key demographic profile, spending habits, purchasing preferences, and engagement with marketing campaigns.

### **Final Segment Personas & Summary**

Drawing from our comprehensive analysis of numerical averages, categorical distributions, and the insights gained from visualizations, we can now define distinct personas for our three customer segments. These personas will help in understanding the unique needs and behaviors of each group, enabling targeted business strategies.

---

#### **Segment 0: The Affluent Achievers**

* **Number of Customers:** 1110 (Largest Segment)
* **Persona Narrative:**
    "The Affluent Achievers" represent the most economically significant segment. Typically older (average age ~58 years) and highly educated (predominantly Graduation, PhD, or Master's degrees), these customers boast the highest average income (around \$67,600). Their households usually have very few young children but may include teenagers. They are established individuals, often married or in long-term relationships.
    This segment demonstrates a strong propensity for high-value purchases across all product categories, particularly excelling in spending on wines, meat products, and other premium items. They are active shoppers across multiple channels – web, catalog, and in-store – and make frequent purchases. Interestingly, while they are high-volume purchasers, they are the least likely to use deals, suggesting a lower price sensitivity. Despite making fewer monthly web visits compared to other segments, their conversion is high, and they are, by far, the most responsive group to marketing campaigns, showing high acceptance rates for past promotions and the latest offers. Their complaint rate is very low.
* **Key Differentiating Characteristics:**
    * Highest income and overall spending.
    * Older, highly educated.
    * High purchase volume across all channels (web, catalog, store).
    * Most responsive to marketing campaigns.
    * Low usage of deals.
    * Few young children at home.
* **Potential Marketing Angles & Business Considerations:**
    * Target with premium product offerings and loyalty programs.
    * Utilize a multi-channel marketing approach, including catalogs and personalized email campaigns highlighting new arrivals and exclusive offers.
    * Focus on value and quality over discounts.
    * Leverage their high campaign responsiveness for new product launches and upselling opportunities.

---

#### **Segment 1: The Family-Focused Navigators**

* **Number of Customers:** 1049
* **Persona Narrative:**
    "The Family-Focused Navigators" form a substantial segment characterized by households with children (both young kids and teenagers are common). They are typically middle-aged (average age ~54 years) with a low-to-middle average income (around \$37,600). Similar to the Affluent Achievers, this segment is also highly educated. Most are married or in a committed relationship.
    Their spending is considerably lower than Segment 0 across all product categories, reflecting a more budget-conscious approach. They make fewer purchases overall and are moderately inclined to use deals. A key behavior is their high frequency of monthly web visits, suggesting they might be active online researchers or browsers, possibly comparing prices or looking for specific family-oriented products. While their responsiveness to past marketing campaigns was low, they show a moderate engagement with the latest offers. Their complaint rate is slightly higher than other segments, though still very low overall.
* **Key Differentiating Characteristics:**
    * Presence of young children and teenagers.
    * Low-to-middle income.
    * Low overall spending.
    * High frequency of web visits.
    * Highly educated.
    * Moderate deal usage.
* **Potential Marketing Angles & Business Considerations:**
    * Focus on family-oriented products, value bundles, and promotions relevant to households with children.
    * Engage through online channels, given their high web visit frequency; content marketing and targeted digital ads could be effective.
    * Highlight value and utility in marketing messages.
    * Test campaigns that resonate with budget-conscious families.
    * Monitor feedback given their slightly higher (though still low) complaint rate.

---

#### **Segment 2: The Young & Thrifty Explorers**

* **Number of Customers:** 54 (Smallest & Niche Segment)
* **Persona Narrative:**
    "The Young & Thrifty Explorers" represent a small but distinct niche. This is the youngest segment (average age ~48 years) with the lowest average income (around \$20,300). A defining characteristic is that **100% of this segment has a "Basic" education level**. Their households often include young children but very few teenagers. They also have a higher proportion of "Single" individuals compared to other segments.
    Their spending is the lowest across almost all categories. While they don't purchase frequently through traditional channels, they are the most active users of deals. Similar to Segment 1, they exhibit the highest frequency of monthly web visits, indicating significant online activity, likely for Browse and deal hunting. This segment is largely unresponsive to most marketing campaigns and had no recorded complaints.
* **Key Differentiating Characteristics:**
    * Youngest age group.
    * Lowest income.
    * **Exclusively "Basic" education.**
    * Presence of young children, few teens.
    * Highest usage of deals.
    * Very low overall spending.
    * Highest web visit frequency.
    * Least responsive to marketing campaigns.
    * Very small segment size.
* **Potential Marketing Angles & Business Considerations:**
    * Given their strong preference for deals, promotions and deep discounts are likely the most effective way to engage them.
    * Target through online channels where they are most active.
    * Focus on entry-level products or loss-leaders to attract them.
    * The small size of this segment means highly targeted, low-cost digital campaigns might be appropriate if pursuing them.
    * Their unique "Basic" education profile might correlate with specific needs or communication style preferences.
    * Given their unresponsiveness to past campaigns, a different approach to marketing messaging may be needed for this group.

---

## **Phase 4: Business Implications, Recommendations, and Conclusion**

---

Having successfully identified and profiled three distinct customer segments, the next critical phase is to translate these data-driven insights into actionable business strategies. Understanding "who" our customers are allows us to think about "how" to best serve them and engage with them effectively.

This phase will focus on:

1.  **Brainstorming Strategic Recommendations for Each Segment:**
    * Based on the unique personas ("Affluent Achievers," "Family-Focused Navigators," and "Young & Thrifty Explorers"), what specific marketing strategies, product offerings, communication styles, and customer service approaches would be most effective for each?
    * How can the company leverage the strengths and address the preferences of each segment to improve customer satisfaction, loyalty, and lifetime value?

2.  **Identifying Business Opportunities:**
    * Do these segments reveal any untapped opportunities or underserved needs in the market?
    * How can the company use this segmentation to optimize marketing spend and improve ROI?

3.  **Formulating a Project Conclusion:**
    * Summarize the key findings of the customer segmentation analysis.
    * Reiterate the characteristics of the identified segments.

4.  **Suggesting Potential Future Work and Limitations:**
    * Briefly discuss any limitations of the current analysis (e.g., data constraints, algorithm choice).
    * Suggest potential future steps, such as:
        * A/B testing targeted strategies on these segments.
        * Incorporating additional data sources (e.g., detailed transaction history, psychographics if available) for richer profiles.
        * Exploring other clustering algorithms or predictive modeling to forecast segment migration.
        * Operationalizing the segmentation (e.g., flagging customers in a CRM system).

### **4.1. Strategic Recommendations for Segments**

#### **Segment 0: "The Affluent Achievers"**

This segment represents the most valuable customers in terms of spending and responsiveness to marketing. Strategies should focus on acknowledging their value, offering premium experiences, and fostering loyalty.

**1. Marketing & Communication:**

* **Messaging Style:**
    * Emphasize quality, exclusivity, premium features, and craftsmanship over price.
    * Highlight new arrivals, curated collections, and limited-edition products.
    * Use sophisticated language and visuals that appeal to their mature and educated status.
    * Acknowledge their loyalty and past purchases in communications.
* **Preferred Channels:**
    * **Direct Mail & Catalogs:** Given their high number of catalog purchases, invest in high-quality print catalogs featuring premium selections.
    * **Email Marketing:** Personalized emails showcasing products related to their past purchases (especially wines, high-quality meats, gourmet foods, gold items). Segment email lists to target them specifically with relevant offers.
    * **Personalized Web Experience:** Although they visit the web less frequently, ensure that when they do, the experience is seamless, perhaps with a curated section for "Top Tier Customers" or "New & Exclusive."
    * **Avoid Over-Messaging on Deals:** Since they use deals the least, bombarding them with discount-heavy communication might be ineffective or even devalue the brand in their eyes.
* **Campaign Focus:**
    * Continue to leverage their high responsiveness to campaigns (Cmp1, Cmp2, Cmp4, Cmp5, and latest `Response`).
    * Focus campaigns on new product launches, exclusive bundles, early access, or loyalty rewards rather than deep discounts.

**2. Product & Service Offerings:**

* **Premium & High-Quality Products:**
    * Ensure a strong offering of high-end wines, gourmet meat products, artisanal foods, and premium gold/jewelry items, as these are categories where they spend significantly.
    * Introduce or highlight exclusive product lines or brands not available to other segments.
* **Value-Added Services:**
    * Consider offering services like personal shopping assistance, curated subscription boxes (e.g., wine club, gourmet food box), or concierge services.
    * Offer enhanced customer support channels (e.g., dedicated phone line, priority service).
* **Loyalty Program:**
    * Implement a tiered loyalty program where this segment quickly reaches or starts at a top tier.
    * Rewards should focus on exclusivity, early access, special experiences, or premium gifts rather than just small discounts.

**3. Pricing & Promotions:**

* **Value-Based Pricing:** They are less price-sensitive; focus on the value and quality proposition.
* **Strategic Promotions:**
    * Instead of percentage-off deals, consider "gift with purchase" of premium items, "buy X get a related premium Y," or early access to sales.
    * Bundling complementary high-value items could be appealing.
    * Loyalty points that translate into significant rewards or exclusive products.
* **Avoid Deep Discounting:** This could erode perceived value for this segment.

**4. Customer Experience:**

* **Recognition & Personalization:** Make them feel valued. Use their purchase history to provide highly relevant recommendations.
* **Efficient Purchasing:** Since they purchase frequently across channels, ensure a smooth and efficient checkout process both online and in-store. For catalog orders, ensure order fulfillment is top-notch.
* **Build Relationships:** For very top spenders within this segment, consider more personalized outreach or account management if feasible.

**Overall Goal for Segment 0:** Maximize their lifetime value by fostering loyalty through premium offerings, exclusive experiences, and recognition, while leveraging their high responsiveness to well-crafted marketing campaigns that emphasize quality and status.

#### **Segment 1: "The Family-Focused Navigators"**

This large segment consists of educated, middle-income families who are generally cautious with their spending but are active online. Strategies should focus on providing value, convenience, and family-relevant offers, primarily through digital channels.

**1. Marketing & Communication:**

* **Messaging Style:**
    * Emphasize value, practicality, family benefits, and solutions for a busy household.
    * Highlight promotions, bundle deals, and budget-friendly options.
    * Use clear, straightforward language. Content that offers tips or solutions for families could resonate.
    * Acknowledge their household structure (presence of kids/teens) if possible in tailored communications.
* **Preferred Channels:**
    * **Digital Channels (Primary Focus):** Given their high number of monthly web visits, focus on:
        * **Website Content:** Ensure the website is easy to navigate, showcases value, and has clear product information. Family-related product categories should be prominent.
        * **Email Marketing:** Send regular emails featuring weekly deals, family meal ideas (if applicable to your products), promotions on kid/teen-related items, and value bundles.
        * **Social Media:** Engage on platforms popular with parents, showcasing user-generated content, family tips, or relevant promotions.
    * **Mobile App (if available):** An app with easy Browse, shopping lists, and exclusive app-only deals could be effective.
* **Campaign Focus:**
    * Their moderate response to the latest campaign (10%) shows some potential. Focus campaigns on:
        * **Value for Money:** "Buy X, Get Y free/discounted," percentage-off deals, or loyalty points for consistent purchases.
        * **Family Bundles:** Grouping products often bought by families at a slightly discounted overall price.
        * **Seasonal Needs:** Back-to-school, holiday deals relevant to families.
    * Test different offers to see what drives conversion, as their response to earlier campaigns was low.

**2. Product & Service Offerings:**

* **Value-Oriented & Family-Sized Products:**
    * Ensure a good selection of budget-friendly options and larger, family-sized packs for relevant product categories.
    * Highlight products that offer convenience for busy parents.
* **Products for Children & Teenagers:**
    * If your inventory allows, promote or expand offerings relevant to their children's age groups.
* **Subscription Options (for essentials):**
    * Consider "subscribe and save" models for staple items this segment might regularly purchase, offering convenience and a small discount.
* **Easy Returns & Good Customer Service:**
    * With families, ease of returns can be important. Ensure a hassle-free process.
    * Given their slightly higher complaint rate, responsive and empathetic customer service is key.

**3. Pricing & Promotions:**

* **Price Sensitivity:** This segment is likely more price-sensitive due to lower-middle income and family expenses.
* **Active Promotion of Deals:** Clearly advertise deals and discounts through their preferred online channels.
* **Loyalty Program:** A points-based loyalty program that rewards regular, even if smaller, purchases and can be redeemed for discounts or practical family items could be appealing.
* **Tiered Discounts:** Offer discounts that increase with the quantity purchased for family staples.

**4. Customer Experience:**

* **Website/App Usability:** A user-friendly online experience is crucial given their high web visitation. Easy search, clear product categories, and efficient checkout are important.
* **Content Marketing:** Provide useful content related to family life, budgeting, or product uses that would appeal to this segment and keep them engaged with the brand online.
* **Gather Feedback:** Since they have a slightly higher (though still low) complaint rate, actively solicit feedback to understand any pain points and improve service.

**Overall Goal for Segment 1:** Nurture this large segment by offering consistent value, family-relevant products and promotions, and a convenient online experience. Focus on converting their high web engagement into sales by making it easy and affordable for them to shop for their families.

#### **Segment 2: "The Young & Thrifty Explorers"**

This is the smallest and a very distinct niche segment, characterized by its youth, lowest income, basic education level, and strong preference for deals. Their high web visit frequency despite low conversion and campaign responsiveness suggests they are active online but highly price-sensitive or looking for very specific value.

**1. Marketing & Communication:**

* **Messaging Style:**
    * Emphasize **extreme value, deep discounts, and clear benefits for a tight budget.**
    * Use simple, direct, and easy-to-understand language, avoiding jargon.
    * Highlight how products can solve immediate needs or offer significant savings.
    * Visuals should be straightforward and clearly showcase the product and the deal.
* **Preferred Channels:**
    * **Digital Channels (Primary Focus):** Given their highest web visit frequency:
        * **Deal Aggregator Sites/Partnerships:** Consider listing specific deep-discount offers on external deal websites or apps that this demographic might frequent.
        * **Website - Dedicated Deals Section:** A very prominent "Clearance," "Deep Discounts," or "Value Buys" section on the website.
        * **Social Media (Targeted Ads):** Use social media platforms with ad targeting focused on demographics matching this segment (younger, budget-conscious, potentially parents of young children) and highlight specific deals.
        * **SMS/Mobile Alerts (with opt-in):** For flash sales or limited-time deep discounts, if an opt-in mechanism exists.
* **Campaign Focus:**
    * Their unresponsiveness to past broad campaigns suggests a need for a highly tailored approach.
    * **Focus exclusively on deal-based campaigns:** "X% off," "Buy One Get One Free (BOGO)" on very low-priced items, clearance sales.
    * **Limited-Time Offers:** Create urgency around deals.
    * Given their basic education, ensure campaign mechanics are extremely simple and easy to understand.

**2. Product & Service Offerings:**

* **Entry-Level & Basic Products:**
    * Offer a selection of essential, no-frills products at the lowest possible price points.
    * Consider "loss leaders" if the strategy is to attract them and hope for occasional small add-on purchases (though their overall spending is low).
* **Small Pack Sizes/Trial Sizes:**
    * Lower commitment and lower price point for trial.
* **Refurbished or Clearance Items:**
    * If applicable to your business, this could be a dedicated offering for this segment.

**3. Pricing & Promotions:**

* **Price Sensitivity:** Extremely high. Price is likely the primary driver.
* **Deep and Frequent Discounts:** This is key. They are motivated by significant savings (`NumDealsPurchases` is highest).
* **Clearance Sales:** Make these very visible.
* **Bundling (for extreme value):** Bundling very low-cost essentials might work if the perceived saving is substantial.

**4. Customer Experience:**

* **Simple and Fast Online Experience:** For their web visits, ensure the path to deals and checkout is extremely simple and quick.
* **Low-Cost Customer Service Options:** If service is needed, self-service options or community forums might be preferred over premium support channels.
* **No-Frills Approach:** This segment is unlikely to pay for or necessarily value elaborate packaging or extensive after-sales service unless it's related to product function for essential items.

**5. Strategic Consideration due to Small Size:**

* **Cost of Acquisition vs. Lifetime Value:** Given their very low spending and small segment size, carefully evaluate the cost of targeted marketing campaigns versus their potential lifetime value. It might be that this segment is best served opportunistically through general clearance activities rather than extensive dedicated campaigns.
* **Growth Potential?:** Is this a segment the business wants to grow, or primarily manage for occasional, deal-driven sales? If growth is desired, the strategy might involve trying to gradually introduce them to slightly higher-value items once initial trust and engagement via deals are established, but this would be challenging given their profile.
* **Understanding Web Visits:** Their high web visits despite low spending and campaign response warrants further investigation. Are they comparing prices extensively? Are they looking for information they don't find? Are they abandoning carts due to shipping costs or other final-stage hurdles? Website analytics could provide clues.

**Overall Goal for Segment 2:** Capture opportunistic sales through deep discounts and clear value propositions, primarily via online channels. Manage acquisition costs carefully due to their low spending and small size. Understand their high web activity better to see if there are low-cost ways to improve conversion for deals.

### **4.2. Identifying Business Opportunities from Segmentation**

The identification of three distinct customer segments ("Affluent Achievers," "Family-Focused Navigators," and "Young & Thrifty Explorers") opens up several strategic business opportunities:

1.  **Enhanced Customer Lifetime Value (CLV) through Targeted Investment:**
    * **Opportunity:** By understanding the high spending power and campaign responsiveness of **Segment 0 ("Affluent Achievers")**, the business can strategically invest more in retaining and delighting this group. Their proven value justifies premium services, personalized attention, and loyalty programs designed to maximize their long-term engagement and spend.
    * *Actionable Insight:* Prioritize resources and marketing budget towards nurturing this segment for sustained high revenue.

2.  **Growth Potential in Under-Penetrated or Developing Segments:**
    * **Opportunity:** **Segment 1 ("Family-Focused Navigators")** is a large group with high web engagement but lower current spending. This signals an opportunity to improve conversion. Understanding their family needs and budget considerations can lead to developing more appealing value propositions (e.g., family bundles, relevant product lines for children/teens, loyalty programs rewarding consistent smaller purchases).
    * *Actionable Insight:* Develop targeted strategies to increase the average spend and purchase frequency of Segment 1 by better aligning offers with their needs and online behavior.

3.  **Optimized Marketing Spend and Improved ROI:**
    * **Opportunity:** Knowing that **Segment 0** is highly responsive to most campaigns (especially non-deal-based ones) and **Segment 2 ("Young & Thrifty Explorers")** is largely unresponsive (except to deep deals) allows for significant optimization of marketing spend. Money spent on broad, non-deal campaigns for Segment 2 is likely wasted.
    * *Actionable Insight:* Reallocate marketing budget away from ineffective campaigns for Segment 2 towards more tailored, high-ROI campaigns for Segment 0 and value-driven digital campaigns for Segment 1. Measure campaign effectiveness per segment.

4.  **New Product Development or Service Customization:**
    * **Opportunity:** The needs of families in **Segment 1** (kids and teens) and the distinct "Basic" education profile of **Segment 2** might highlight gaps or opportunities in the current product/service assortment. For example, are there specific products for families or entry-level/basic product versions that could be introduced or better promoted?
    * *Actionable Insight:* Conduct further research (e.g., surveys, focus groups) within Segment 1 and 2 to explore unmet needs that could lead to new product development or service modifications.

5.  **Improved Customer Acquisition Strategies:**
    * **Opportunity:** The detailed profile of the highly valuable **Segment 0** can be used to create look-alike audiences for new customer acquisition campaigns. Marketing efforts can be focused on attracting new customers who share similar demographic and behavioral traits with this profitable segment.
    * *Actionable Insight:* Refine acquisition marketing to target prospects resembling "Affluent Achievers."

6.  **Strategic Channel Development:**
    * **Opportunity:** **Segment 0** shows strong engagement with catalog and store purchases, suggesting these channels are vital for this high-value group. **Segments 1 and 2** are highly active online (high web visits), indicating that optimizing the website for conversion (especially for Segment 1) and providing clear pathways to deals (for Segment 2) is crucial.
    * *Actionable Insight:* Invest in maintaining high-quality catalog and in-store experiences for Segment 0, while heavily focusing on website UX, targeted content, and deal visibility for Segments 1 & 2.

7.  **Managing Low-Value or Niche Segments Effectively:**
    * **Opportunity:** For the very small **Segment 2 ("Young & Thrifty Explorers")**, which has low spending and low campaign responsiveness (except to deep deals), the opportunity lies in cost-effective management.
    * *Actionable Insight:* Serve this segment primarily through automated, low-cost channels (e.g., a "clearance" section on the website) and avoid expensive, broad marketing efforts. Focus on maximizing conversion for the deals they seek with minimal operational overhead.

8.  **Personalization at Scale:**
    * **Opportunity:** This segmentation provides a foundational framework for moving towards more personalized customer experiences. Even simple rule-based personalization (e.g., different website landing pages or email streams per segment) can be implemented.
    * *Actionable Insight:* Explore CRM and marketing automation capabilities to deliver segment-specific content and offers.

By focusing on these opportunities, the business can make more informed decisions, allocate resources more effectively, and build stronger relationships with each customer segment based on their unique profiles and needs.

### **4.3. Project Conclusion**

This project aimed to segment the retail company's customer base to gain deeper insights into their diverse profiles, purchasing behaviors, and engagement patterns. The ultimate goal was to provide a foundation for developing tailored marketing strategies, enhancing customer satisfaction, and optimizing business decisions to improve revenue and customer loyalty.

**Methodology Overview:**
The project involved a comprehensive data science workflow:
1.  **Data Exploration and Preparation:** The dataset was meticulously cleaned, with missing values handled, outliers addressed (e.g., for `Age`), and new features engineered (e.g., `Age` from `Year_Birth`). Categorical features were appropriately encoded, and skewed numerical features were log-transformed to prepare the data for robust modeling. All features intended for clustering were then standardized.
2.  **Exploratory Data Analysis (EDA):** In-depth univariate and bivariate analyses were conducted to understand the underlying patterns, distributions, and relationships within the data.
3.  **Optimal Cluster Determination:** Various statistical methods, including the Elbow Method (WCSS/Inertia), Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index, were employed. Based on a consolidated view of these metrics (using the accurate output from the notebook), **three (k=3)** was determined to be the optimal number of customer segments.
4.  **K-Means Clustering:** The K-Means algorithm was applied to the prepared dataset with `n_clusters=3`.
5.  **Segment Profiling and Visualization:** Each of the three resulting clusters was thoroughly profiled by analyzing the mean values of numerical features and the distribution of categorical features. These profiles were further illuminated through various visualizations, including interactive box plots and bar charts (using Plotly), and a PCA plot to visualize cluster separation in a reduced dimensional space.

**Key Findings: Customer Segments Identified**

The analysis successfully identified three distinct customer segments, each with unique characteristics:

1.  **Segment 0: "The Affluent Achievers" (Approx. 50% of customers, N=1110)**
    * **Profile:** This segment represents the most economically valuable group. They are characterized by the highest average income, are generally older (average age ~58), and are highly educated. Their households typically have few young children.
    * **Behavior:** They are high-volume spenders across all product categories and purchase channels (web, catalog, store). Notably, they are the most responsive to marketing campaigns but use deals the least, indicating lower price sensitivity.

2.  **Segment 1: "The Family-Focused Navigators" (Approx. 47% of customers, N=1049)**
    * **Profile:** This large segment consists of middle-aged individuals (average age ~54) with low-to-middle incomes, often with young children and teenagers in the household. They are also highly educated.
    * **Behavior:** Their spending is cautious and significantly lower than Segment 0. They are active online (high web visits), use deals moderately, and show moderate engagement with recent marketing campaigns.

3.  **Segment 2: "The Young & Thrifty Explorers" (Approx. 2.4% of customers, N=54)**
    * **Profile:** This is a small, niche segment representing the youngest customers (average age ~48) with the lowest incomes. A striking characteristic is their "Basic" education level. They often have young children but few teens, and a higher proportion are single.
    * **Behavior:** They are very low spenders but are the most frequent users of deals. They exhibit high web visit frequency (suggesting online research or deal hunting) but are the least responsive to marketing campaigns. This group had no recorded complaints.

**Overall Business Value and Implications:**
This segmentation provides a clear, data-driven framework for understanding the heterogeneous nature of the company's customer base. The distinct profiles of "The Affluent Achievers," "The Family-Focused Navigators," and "The Young & Thrifty Explorers" enable the business to move beyond one-size-fits-all strategies.

Key implications include the ability to:
* **Tailor Marketing Messages and Offers:** Craft distinct communications and promotions that resonate with the specific motivations, needs, and price sensitivities of each segment.
* **Optimize Marketing Channel Strategy:** Allocate resources to the channels most effective for reaching and engaging each segment (e.g., catalogs for Segment 0, enhanced online engagement for Segments 1 & 2).
* **Personalize Product Recommendations and Service Offerings:** Develop or highlight products and services that align with the demographic and lifestyle characteristics of each group.
* **Improve Customer Relationship Management:** Foster loyalty among high-value segments (like Segment 0) through targeted programs, while developing cost-effective strategies for engaging other segments.
* **Identify Growth and Efficiency Opportunities:** Pinpoint segments with growth potential (e.g., converting Segment 1's web visits into sales) and areas where marketing spend can be made more efficient (e.g., with Segment 2).

In conclusion, this customer segmentation project has yielded actionable insights that can empower the retail company to make more strategic, customer-centric decisions, ultimately aiming to enhance customer engagement, increase revenue, and build stronger, more profitable customer relationships.

## **4.4. Future Work**

While this customer segmentation analysis has yielded valuable insights and distinct customer personas, it's important to consider areas for future development to build upon these findings.

**Potential Future Work:**

1.  **A/B Testing of Segment-Specific Strategies:**
    * **Action:** Design and execute A/B tests for the marketing messages, offers, and channel strategies tailored for each identified segment ("Affluent Achievers," "Family-Focused Navigators," "Young & Thrifty Explorers").
    * **Benefit:** Quantitatively measure the effectiveness of these targeted approaches in terms of conversion rates, average order value, and ROI.

2.  **Developing a Predictive Segmentation Model:**
    * **Action:** Train a classification model (e.g., Logistic Regression, Decision Tree, Random Forest, Gradient Boosting) using the cluster labels as the target variable and the original customer features (before scaling for K-Means) as predictors.
    * **Benefit:** This would allow the company to automatically assign new customers to one of these segments as they onboard, enabling immediate personalization.

3.  **Deeper Investigation of Segment 2 ("Young & Thrifty Explorers"):**
    * **Action:** Given its small size and unique characteristics (especially 100% "Basic" education and high deal sensitivity), conduct qualitative research (e.g., surveys, short interviews if possible) with a sample from this segment to better understand their motivations, needs, and specific pain points.
    * **Benefit:** Validate the persona and determine if this segment warrants dedicated (but low-cost) strategies or if its members might be better served through broader discount policies.

4.  **Exploring Alternative Clustering Algorithms:**
    * **Action:** Experiment with other clustering techniques, such as Hierarchical Clustering (to explore different numbers of clusters and dendrogram structures), DBSCAN (which can find arbitrarily shaped clusters and identify noise/outliers), or Gaussian Mixture Models (which are probabilistic).
    * **Benefit:** See if alternative algorithms yield different, potentially more nuanced, or more robust segmentations.

5.  **Incorporating Additional Data Sources:**
    * **Action:** If feasible, integrate richer data sources into the analysis, such as:
        * Detailed transaction data (specific products purchased, basket size, purchase frequency per product category).
        * Website clickstream data (pages visited, time spent, paths taken).
        * Customer service interaction logs.
        * Social media data (if ethically sourced and relevant).
    * **Benefit:** Create even more detailed and behaviorally rich customer profiles.

6.  **Operationalizing and Monitoring Segments:**
    * **Action:** Implement a system to tag customers with their segment label in the company's CRM or marketing automation platform. Establish KPIs to monitor the size, spending patterns, and engagement of each segment over time.
    * **Benefit:** Enables ongoing targeted marketing and allows for tracking segment evolution and the effectiveness of strategies. Re-run the segmentation periodically (e.g., annually or semi-annually) to ensure its continued relevance.

7.  **Analyzing Customer Journey and Segment Migration:**
    * **Action:** Over time, track if and how customers move between segments.
    * **Benefit:** Understand drivers of segment migration and develop strategies to encourage movement towards more valuable segments or prevent churn from key segments.

By addressing these limitations and pursuing relevant future work, the company can continuously refine its understanding of its customers and enhance its ability to engage with them effectively.