# Customer Segmentation - Exploratory Data Analysis (EDA)

This notebook focuses on exploring the preprocessed customer behavior data to gain insights before applying clustering algorithms. We'll analyze:

1. Feature distributions
2. Correlation between features
3. Feature relationships
4. Dimensionality reduction using PCA
5. Initial visual exploration for potential clusters

In [1]:
!pip3 install plotly

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import plotly.express as px
import os

# Set plotting style
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Increase default figure size
plt.rcParams['figure.figsize'] = [12, 8]

# Create output directory if it doesn't exist
os.makedirs('./output', exist_ok=True)

ModuleNotFoundError: No module named 'plotly'

## 1. Load Preprocessed Data

In [None]:
# Load the preprocessed data
preprocessed_file = './output/preprocessed_data.csv'
df = pd.read_csv(preprocessed_file)

# Display basic information
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Also load the unscaled data for reference
unscaled_file = './output/cleaned_data_unscaled.csv'
df_unscaled = pd.read_csv(unscaled_file)

# Display unscaled data
print("Unscaled data (first few rows):")
df_unscaled.head()

In [None]:
# Check data types and basic info
df.info()

In [None]:
# Get statistical summary of preprocessed data
df.describe().T

## 2. Feature Distributions

In [None]:
# Select relevant columns for analysis (exclude customer_id if present)
feature_cols = [col for col in df.columns if col != 'customer_id']

# Plot histograms of each feature
plt.figure(figsize=(16, 12))

for i, col in enumerate(feature_cols):
    plt.subplot(3, 2, i+1)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.tight_layout()

plt.savefig('./output/feature_distributions.png', dpi=300)
plt.show()

In [None]:
# Plot distributions of unscaled features for better interpretation
unscaled_feature_cols = [col for col in df_unscaled.columns if col != 'customer_id']

plt.figure(figsize=(16, 12))

for i, col in enumerate(unscaled_feature_cols):
    plt.subplot(3, 2, i+1)
    sns.histplot(df_unscaled[col], kde=True)
    plt.title(f'Original Distribution of {col}')
    plt.tight_layout()

plt.savefig('./output/original_feature_distributions.png', dpi=300)
plt.show()

In [None]:
# Create box plots to visualize feature distributions
plt.figure(figsize=(14, 10))

for i, col in enumerate(feature_cols):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x=df[col])
    plt.title(f'Box Plot of {col}')
    
plt.tight_layout()
plt.savefig('./output/feature_boxplots.png', dpi=300)
plt.show()

## 3. Correlation Analysis

In [None]:
# Create correlation matrix
corr_matrix = df[feature_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('./output/correlation_heatmap.png', dpi=300)
plt.show()

In [None]:
# Identify high correlations
high_corr = []
for i, row in enumerate(corr_matrix.values):
    for j, corr in enumerate(row):
        if i < j and abs(corr) > 0.5:  # Only upper triangle and significant correlations
            high_corr.append((corr_matrix.index[i], corr_matrix.columns[j], corr))

if high_corr:
    print("Features with high correlation (|r| > 0.5):")
    for feature1, feature2, corr in high_corr:
        print(f"{feature1} and {feature2}: {corr:.3f}")
else:
    print("No high correlations found between features.")

## 4. Feature Relationships

In [None]:
# Create pairplot to visualize relationships between features
sns.pairplot(df[feature_cols], diag_kind='kde')
plt.suptitle("Pairwise Feature Relationships", y=1.02, fontsize=16)
plt.tight_layout()
plt.savefig('./output/pairplot.png', dpi=300)
plt.show()

In [None]:
# Create some specific scatter plots based on domain knowledge
plt.figure(figsize=(14, 10))

# Plot 1: total_purchases vs avg_cart_value
plt.subplot(2, 2, 1)
sns.scatterplot(x='total_purchases', y='avg_cart_value', data=df_unscaled, alpha=0.7)
plt.title('Total Purchases vs Average Cart Value')
plt.xlabel('Total Purchases')
plt.ylabel('Average Cart Value')

# Plot 2: total_time_spent vs product_click
plt.subplot(2, 2, 2)
sns.scatterplot(x='total_time_spent', y='product_click', data=df_unscaled, alpha=0.7)
plt.title('Time Spent vs Product Clicks')
plt.xlabel('Total Time Spent (minutes)')
plt.ylabel('Product Clicks')

# Plot 3: discount usage vs avg_cart_value
discount_col = 'discount_counts' if 'discount_counts' in df_unscaled.columns else 'discount_count'
plt.subplot(2, 2, 3)
sns.scatterplot(x=discount_col, y='avg_cart_value', data=df_unscaled, alpha=0.7)
plt.title('Discount Usage vs Average Cart Value')
plt.xlabel('Discount Usage Count')
plt.ylabel('Average Cart Value')

# Plot 4: total_purchases vs discount usage
plt.subplot(2, 2, 4)
sns.scatterplot(x='total_purchases', y=discount_col, data=df_unscaled, alpha=0.7)
plt.title('Total Purchases vs Discount Usage')
plt.xlabel('Total Purchases')
plt.ylabel('Discount Usage Count')

plt.tight_layout()
plt.savefig('./output/key_feature_relationships.png', dpi=300)
plt.show()

## 5. Dimensionality Reduction with PCA

In [None]:
# Calculate PCA
pca = PCA()
principal_components = pca.fit_transform(df[feature_cols])

# Calculate explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Print explained variance
print("Explained variance ratio by component:")
for i, ratio in enumerate(explained_variance):
    print(f"PC{i+1}: {ratio:.4f} ({cumulative_variance[i]:.4f} cumulative)")

In [None]:
# Plot explained variance
plt.figure(figsize=(10, 6))

# Individual explained variance
plt.bar(range(1, len(explained_variance) + 1), 
        explained_variance, 
        alpha=0.7, 
        label='Individual explained variance')

# Cumulative explained variance
plt.step(range(1, len(cumulative_variance) + 1), 
         cumulative_variance, 
         where='mid', 
         label='Cumulative explained variance',
         color='red')

plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA Explained Variance')
plt.grid(linestyle='--', alpha=0.5)
plt.legend()
plt.tight_layout()
plt.savefig('./output/pca_explained_variance.png', dpi=300)
plt.show()

In [None]:
# Create a dataframe with principal components
pca_df = pd.DataFrame(
    data=principal_components[:, :2],
    columns=['PC1', 'PC2']
)

# Add customer_id if it exists in the original dataframe
if 'customer_id' in df.columns:
    pca_df['customer_id'] = df['customer_id'].values

# Visualize the data in 2D PCA space
plt.figure(figsize=(10, 8))
plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.5)
plt.title('PCA: First Two Principal Components')
plt.xlabel(f'PC1 ({explained_variance[0]:.2%} variance)')
plt.ylabel(f'PC2 ({explained_variance[1]:.2%} variance)')
plt.grid(linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('./output/pca_visualization.png', dpi=300)
plt.show()

In [None]:
# Create an interactive 3D scatter plot with the first three principal components
if len(explained_variance) >= 3:
    pca_3d_df = pd.DataFrame(
        data=principal_components[:, :3],
        columns=['PC1', 'PC2', 'PC3']
    )
    
    fig = px.scatter_3d(
        pca_3d_df, 
        x='PC1', 
        y='PC2', 
        z='PC3',
        title='PCA: First Three Principal Components',
        opacity=0.7,
        labels={
            'PC1': f'PC1 ({explained_variance[0]:.2%})',
            'PC2': f'PC2 ({explained_variance[1]:.2%})',
            'PC3': f'PC3 ({explained_variance[2]:.2%})'
        }
    )
    fig.update_traces(marker=dict(size=5))
    fig.write_html('./output/pca_3d_visualization.html')
    fig.show()

In [None]:
# Extract and visualize feature loadings (coefficients)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
loading_df = pd.DataFrame(
    loadings, 
    columns=[f'PC{i+1}' for i in range(len(pca.components_))],
    index=feature_cols
)

print("PCA Feature Loadings:")
print(loading_df)

# Plot heatmap of feature loadings
plt.figure(figsize=(10, 8))
sns.heatmap(loading_df, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('PCA Feature Loadings')
plt.tight_layout()
plt.savefig('./output/pca_loadings.png', dpi=300)
plt.show()

## 6. Initial Clustering Observations

In [None]:
# Create a more detailed scatter plot with key features colored by potential groups
# This can help identify potential customer segments visually

# Example 1: Color by total_purchases
plt.figure(figsize=(12, 10))
plt.scatter(df_unscaled['avg_cart_value'], df_unscaled['total_time_spent'], 
            c=df_unscaled['total_purchases'], cmap='viridis', alpha=0.7, s=50)
plt.colorbar(label='Total Purchases')
plt.title('Potential Customer Segments by Total Purchases')
plt.xlabel('Average Cart Value')
plt.ylabel('Total Time Spent')
plt.grid(linestyle='--', alpha=0.5)
plt.savefig('./output/potential_segments_purchases.png', dpi=300)
plt.show()

In [None]:
# Example 2: Color by discount usage
discount_col = 'discount_counts' if 'discount_counts' in df_unscaled.columns else 'discount_count'
plt.figure(figsize=(12, 10))
plt.scatter(df_unscaled['avg_cart_value'], df_unscaled['total_purchases'], 
            c=df_unscaled[discount_col], cmap='viridis', alpha=0.7, s=50)
plt.colorbar(label='Discount Usage')
plt.title('Potential Customer Segments by Discount Usage')
plt.xlabel('Average Cart Value')
plt.ylabel('Total Purchases')
plt.grid(linestyle='--', alpha=0.5)
plt.savefig('./output/potential_segments_discounts.png', dpi=300)
plt.show()

In [None]:
# Example 3: Color by product clicks
plt.figure(figsize=(12, 10))
plt.scatter(df_unscaled['total_time_spent'], df_unscaled['total_purchases'], 
            c=df_unscaled['product_click'], cmap='viridis', alpha=0.7, s=50)
plt.colorbar(label='Product Clicks')
plt.title('Potential Customer Segments by Product Clicks')
plt.xlabel('Total Time Spent')
plt.ylabel('Total Purchases')
plt.grid(linestyle='--', alpha=0.5)
plt.savefig('./output/potential_segments_clicks.png', dpi=300)
plt.show()

## 7. Save Processed Data for Modeling

In [None]:
# Save PCA results
pca_df.to_csv('./output/pca_results.csv', index=False)

# Save feature loadings
loading_df.to_csv('./output/pca_feature_loadings.csv')

print("EDA results saved successfully!")

## Summary of EDA Findings

From our exploratory data analysis, we can observe:

1. **Feature Distributions**: 
   - We observed various distributions across features, some with potential outliers
   - These distributions may indicate the presence of different customer segments

2. **Correlations**:
   - Identified relationships between key features like time spent and product clicks
   - Some expected correlations between discount usage and purchasing behavior

3. **PCA Results**:
   - The first two/three principal components capture a significant portion of variance
   - Visual inspection of the PCA plot suggests potential clusters

4. **Potential Segments**:
   - Initial visualization suggests the presence of distinct customer segments
   - We can see patterns that align with the expected segments (Bargain Hunters, High Spenders, Window Shoppers)

These insights will guide our clustering approach in the next notebook.