# Week 7: Advanced EDA with Business Intelligence - Part 1: Customer Behavior Analysis and Segmentation

## Learning Objectives
By the end of this session, you will be able to:
- Conduct comprehensive customer behavior analysis using advanced EDA techniques
- Implement data-driven customer segmentation strategies
- Apply RFM (Recency, Frequency, Monetary) analysis for customer insights
- Create customer journey and lifecycle analysis
- Build actionable customer personas based on behavioral data

## Business Context
Today we dive deep into **customer behavior analytics** using our live Olist e-commerce dataset. Understanding customer patterns is crucial for:
- **Customer Retention**: Identifying at-risk customers
- **Revenue Optimization**: Finding high-value customer segments
- **Marketing Strategy**: Personalizing campaigns based on behavior
- **Product Strategy**: Understanding purchase patterns and preferences

**Key Business Questions:**
- Who are our most valuable customers and what drives their behavior?
- How can we segment customers for targeted marketing?
- What patterns exist in customer purchase journeys?
- Which customers are at risk of churning?

## 1. Environment Setup and Secure Data Connection

In [None]:
# Essential imports for customer behavior analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Advanced analytics libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy import stats
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# Visualization enhancement
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

# Database connection (secure)
import os
from sqlalchemy import create_engine

# Display and plotting settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 8)

print("✅ Environment setup complete for customer behavior analysis!")

In [None]:
# Secure Database Connection Using Environment Variables
# NOTE: In production, always use environment variables for credentials

# For Google Colab users: Set environment variables
# You can set these in your Colab environment or use Colab secrets
import os

# Set up environment variables (replace with your actual credentials)
# In production, these should be set at the system level
if 'SUPABASE_DB_HOST' not in os.environ:
    # Temporary setup for educational purposes only
    # NEVER hardcode credentials in production code!
    os.environ['SUPABASE_DB_HOST'] = 'aws-0-us-east-1.pooler.supabase.com'
    os.environ['SUPABASE_DB_PORT'] = '6543'
    os.environ['SUPABASE_DB_NAME'] = 'postgres'
    os.environ['SUPABASE_DB_USER'] = 'postgres.pzykoxdiwsyclwfqfiii'
    os.environ['SUPABASE_DB_PASSWORD'] = 'L3tMeQuery123!'

# Construct database URL from environment variables
DATABASE_URL = f"postgresql://{os.environ['SUPABASE_DB_USER']}:{os.environ['SUPABASE_DB_PASSWORD']}@{os.environ['SUPABASE_DB_HOST']}:{os.environ['SUPABASE_DB_PORT']}/{os.environ['SUPABASE_DB_NAME']}"

# Create secure database engine
engine = create_engine(DATABASE_URL)

# Test connection
try:
    with engine.connect() as conn:
        result = conn.execute("SELECT 1 as connection_test")
        print("✅ Secure database connection established!")
except Exception as e:
    print(f"❌ Connection failed: {e}")

print("\n🔒 Security Note: Database credentials loaded from environment variables")
print("   This is the secure way to handle sensitive connection information.")

## 2. Customer Data Loading and Preparation

Let's load comprehensive customer data for behavioral analysis.

In [None]:
# Comprehensive Customer Behavior Dataset
print("🔄 Loading comprehensive customer behavior dataset...")

# Customer transaction and behavior query
customer_behavior_query = """
WITH customer_orders AS (
    SELECT 
        c.customer_unique_id,
        c.customer_state,
        c.customer_city,
        o.order_id,
        o.order_purchase_timestamp,
        o.order_delivered_customer_date,
        EXTRACT(YEAR FROM o.order_purchase_timestamp) as order_year,
        EXTRACT(MONTH FROM o.order_purchase_timestamp) as order_month,
        EXTRACT(DOW FROM o.order_purchase_timestamp) as order_dow,
        EXTRACT(HOUR FROM o.order_purchase_timestamp) as order_hour,
        DATE_PART('day', o.order_delivered_customer_date - o.order_purchase_timestamp) as delivery_days
    FROM olist_sales_data_set.olist_customers_dataset c
    JOIN olist_sales_data_set.olist_orders_dataset o ON c.customer_id = o.customer_id
    WHERE o.order_status = 'delivered'
    AND o.order_delivered_customer_date IS NOT NULL
),
order_financials AS (
    SELECT 
        co.*,
        oi.product_id,
        oi.price,
        oi.freight_value,
        (oi.price + oi.freight_value) as total_order_value,
        p.product_category_name,
        COALESCE(pt.product_category_name_english, p.product_category_name) as category_english,
        r.review_score,
        CASE 
            WHEN r.review_score >= 4 THEN 'Satisfied'
            WHEN r.review_score = 3 THEN 'Neutral'
            WHEN r.review_score <= 2 THEN 'Dissatisfied'
            ELSE 'No Review'
        END as satisfaction_level
    FROM customer_orders co
    JOIN olist_sales_data_set.olist_order_items_dataset oi ON co.order_id = oi.order_id
    JOIN olist_sales_data_set.olist_products_dataset p ON oi.product_id = p.product_id
    LEFT JOIN olist_sales_data_set.product_category_name_translation pt 
        ON p.product_category_name = pt.product_category_name
    LEFT JOIN olist_sales_data_set.olist_order_reviews_dataset r ON co.order_id = r.order_id
    WHERE oi.price > 0
)
SELECT * FROM order_financials
LIMIT 25000;
"""

# Load the data
customer_df = pd.read_sql(customer_behavior_query, engine)

# Data preprocessing
customer_df['order_purchase_timestamp'] = pd.to_datetime(customer_df['order_purchase_timestamp'])
customer_df['order_delivered_customer_date'] = pd.to_datetime(customer_df['order_delivered_customer_date'])
customer_df['category_clean'] = customer_df['category_english'].fillna('Unknown').str.title()

# Calculate analysis period
analysis_end_date = customer_df['order_purchase_timestamp'].max()
analysis_start_date = customer_df['order_purchase_timestamp'].min()

print(f"✅ Customer behavior dataset loaded successfully!")
print(f"   📊 Total records: {len(customer_df):,}")
print(f"   👥 Unique customers: {customer_df['customer_unique_id'].nunique():,}")
print(f"   📦 Unique orders: {customer_df['order_id'].nunique():,}")
print(f"   📅 Analysis period: {analysis_start_date.date()} to {analysis_end_date.date()}")
print(f"   🏷️ Product categories: {customer_df['category_clean'].nunique()}")

# Display sample data
print("\n📋 Sample Customer Behavior Data:")
display(customer_df[['customer_unique_id', 'order_purchase_timestamp', 'category_clean', 
                   'total_order_value', 'review_score', 'satisfaction_level']].head())

## 3. RFM Analysis - Customer Value Segmentation

RFM (Recency, Frequency, Monetary) analysis is a foundational technique for customer segmentation.

In [None]:
# RFM Analysis Implementation
print("📊 Implementing RFM (Recency, Frequency, Monetary) Analysis")
print("=" * 60)

def calculate_rfm_metrics(data, customer_id_col, date_col, monetary_col):
    """
    Calculate RFM metrics for customer segmentation
    
    Parameters:
    -----------
    data : pd.DataFrame
        Customer transaction data
    customer_id_col : str
        Name of customer ID column
    date_col : str
        Name of transaction date column
    monetary_col : str
        Name of monetary value column
    """
    # Reference date for recency calculation (latest date in dataset)
    reference_date = data[date_col].max()
    
    # Calculate RFM metrics
    rfm = data.groupby(customer_id_col).agg({
        date_col: lambda x: (reference_date - x.max()).days,  # Recency
        'order_id': 'nunique',  # Frequency
        monetary_col: 'sum'  # Monetary
    }).reset_index()
    
    # Rename columns
    rfm.columns = [customer_id_col, 'Recency', 'Frequency', 'Monetary']
    
    # Calculate additional metrics
    rfm['Avg_Order_Value'] = data.groupby(customer_id_col)[monetary_col].mean().values
    rfm['Total_Items'] = data.groupby(customer_id_col).size().values
    
    return rfm

# Calculate RFM metrics
rfm_data = calculate_rfm_metrics(
    customer_df, 
    'customer_unique_id', 
    'order_purchase_timestamp', 
    'total_order_value'
)

print(f"📈 RFM Analysis Results:")
print(f"   • Customers analyzed: {len(rfm_data):,}")
print(f"   • Average recency: {rfm_data['Recency'].mean():.1f} days")
print(f"   • Average frequency: {rfm_data['Frequency'].mean():.1f} orders")
print(f"   • Average monetary value: R$ {rfm_data['Monetary'].mean():.2f}")

# Display RFM statistics
print("\n📊 RFM Distribution Statistics:")
display(rfm_data[['Recency', 'Frequency', 'Monetary', 'Avg_Order_Value']].describe())

# RFM Scoring (1-5 scale)
def assign_rfm_scores(rfm_df):
    """
    Assign RFM scores using quintile-based scoring
    """
    rfm_scored = rfm_df.copy()
    
    # Recency Score (lower recency = higher score)
    rfm_scored['R_Score'] = pd.qcut(rfm_scored['Recency'], 5, labels=[5,4,3,2,1])
    
    # Frequency Score (higher frequency = higher score)
    rfm_scored['F_Score'] = pd.qcut(rfm_scored['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
    
    # Monetary Score (higher monetary = higher score)
    rfm_scored['M_Score'] = pd.qcut(rfm_scored['Monetary'], 5, labels=[1,2,3,4,5])
    
    # Convert to numeric
    rfm_scored['R_Score'] = rfm_scored['R_Score'].astype(int)
    rfm_scored['F_Score'] = rfm_scored['F_Score'].astype(int)
    rfm_scored['M_Score'] = rfm_scored['M_Score'].astype(int)
    
    # Create RFM Score combination
    rfm_scored['RFM_Score'] = (
        rfm_scored['R_Score'].astype(str) + 
        rfm_scored['F_Score'].astype(str) + 
        rfm_scored['M_Score'].astype(str)
    )
    
    # Calculate overall score
    rfm_scored['RFM_Score_Numeric'] = (
        rfm_scored['R_Score'] + rfm_scored['F_Score'] + rfm_scored['M_Score']
    )
    
    return rfm_scored

# Apply RFM scoring
rfm_scored = assign_rfm_scores(rfm_data)

print("\n🎯 RFM Scoring Complete!")
print("\n📋 Sample RFM Scores:")
display(rfm_scored[['customer_unique_id', 'Recency', 'Frequency', 'Monetary', 
                  'R_Score', 'F_Score', 'M_Score', 'RFM_Score']].head(10))

In [None]:
# RFM Customer Segmentation
print("🎯 RFM-Based Customer Segmentation")
print("=" * 40)

def create_rfm_segments(rfm_df):
    """
    Create customer segments based on RFM scores
    """
    def segment_customers(row):
        r, f, m = row['R_Score'], row['F_Score'], row['M_Score']
        
        # Champions: High RFM scores
        if r >= 4 and f >= 4 and m >= 4:
            return 'Champions'
        
        # Loyal Customers: High R and F, any M
        elif r >= 4 and f >= 4:
            return 'Loyal Customers'
        
        # Potential Loyalists: High R, any F and M
        elif r >= 4:
            return 'Potential Loyalists'
        
        # Recent Customers: High R, low F
        elif r >= 3 and f <= 2:
            return 'Recent Customers'
        
        # Promising: Medium R and F
        elif r >= 2 and f >= 2 and m >= 2:
            return 'Promising'
        
        # Customers Needing Attention: Low R, high F and M
        elif r <= 2 and f >= 3 and m >= 3:
            return 'Customers Needing Attention'
        
        # About to Sleep: Low R and F, any M
        elif r <= 2 and f <= 2:
            return 'About to Sleep'
        
        # At Risk: Low R, medium to high F and M
        elif r <= 2 and f >= 2:
            return 'At Risk'
        
        # Cannot Lose Them: Low R, high F and M
        elif f >= 4 and m >= 4:
            return 'Cannot Lose Them'
        
        # Hibernating: Low RFM scores
        else:
            return 'Hibernating'
    
    rfm_df['Customer_Segment'] = rfm_df.apply(segment_customers, axis=1)
    return rfm_df

# Apply segmentation
rfm_segmented = create_rfm_segments(rfm_scored)

# Analyze segments
segment_analysis = rfm_segmented.groupby('Customer_Segment').agg({
    'customer_unique_id': 'count',
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': ['mean', 'sum'],
    'Avg_Order_Value': 'mean'
}).round(2)

# Flatten column names
segment_analysis.columns = ['Count', 'Avg_Recency', 'Avg_Frequency', 
                           'Avg_Monetary', 'Total_Revenue', 'Avg_Order_Value']

# Calculate percentages
segment_analysis['Percentage'] = (segment_analysis['Count'] / len(rfm_segmented) * 100).round(1)

# Sort by revenue contribution
segment_analysis = segment_analysis.sort_values('Total_Revenue', ascending=False)

print("\n🏆 Customer Segment Analysis:")
display(segment_analysis)

# Business insights
total_customers = len(rfm_segmented)
total_revenue = segment_analysis['Total_Revenue'].sum()
top_segment = segment_analysis.index[0]
top_segment_revenue_pct = (segment_analysis.loc[top_segment, 'Total_Revenue'] / total_revenue * 100)

print(f"\n💡 Key Segment Insights:")
print(f"   • Total customers analyzed: {total_customers:,}")
print(f"   • Most valuable segment: {top_segment} ({segment_analysis.loc[top_segment, 'Percentage']:.1f}% of customers)")
print(f"   • {top_segment} contributes {top_segment_revenue_pct:.1f}% of total revenue")
print(f"   • Average order value of {top_segment}: R$ {segment_analysis.loc[top_segment, 'Avg_Order_Value']:.2f}")

# At-risk analysis
at_risk_segments = ['At Risk', 'About to Sleep', 'Hibernating', 'Cannot Lose Them']
at_risk_customers = segment_analysis.loc[segment_analysis.index.isin(at_risk_segments), 'Count'].sum()
at_risk_percentage = (at_risk_customers / total_customers * 100)

print(f"\n⚠️ Customer Retention Alert:")
print(f"   • At-risk customers: {at_risk_customers:,} ({at_risk_percentage:.1f}% of customer base)")
print(f"   • These segments require immediate attention for retention")

In [None]:
# RFM Visualization Dashboard
print("📊 Creating RFM Analysis Visualizations")
print("=" * 40)

# Create comprehensive RFM visualization
fig = plt.figure(figsize=(20, 15))

# 1. Customer Segment Distribution
plt.subplot(2, 3, 1)
segment_counts = rfm_segmented['Customer_Segment'].value_counts()
colors = plt.cm.Set3(np.linspace(0, 1, len(segment_counts)))
plt.pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%', 
        colors=colors, startangle=90)
plt.title('Customer Segment Distribution', fontsize=14, fontweight='bold')

# 2. Revenue by Segment
plt.subplot(2, 3, 2)
segment_revenue = segment_analysis.sort_values('Total_Revenue', ascending=True)
plt.barh(range(len(segment_revenue)), segment_revenue['Total_Revenue'], color='lightcoral')
plt.yticks(range(len(segment_revenue)), segment_revenue.index)
plt.xlabel('Total Revenue (R$)')
plt.title('Revenue Contribution by Segment', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# 3. RFM Score Distribution
plt.subplot(2, 3, 3)
plt.hist(rfm_segmented['RFM_Score_Numeric'], bins=15, color='skyblue', alpha=0.7, edgecolor='black')
plt.axvline(rfm_segmented['RFM_Score_Numeric'].mean(), color='red', linestyle='--', 
           label=f'Mean: {rfm_segmented["RFM_Score_Numeric"].mean():.1f}')
plt.xlabel('RFM Score (Sum)')
plt.ylabel('Number of Customers')
plt.title('Distribution of RFM Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# 4. RFM Heatmap
plt.subplot(2, 3, 4)
rfm_summary = rfm_segmented.groupby(['R_Score', 'F_Score'])['Monetary'].mean().reset_index()
rfm_pivot = rfm_summary.pivot(index='F_Score', columns='R_Score', values='Monetary')
sns.heatmap(rfm_pivot, annot=True, fmt='.0f', cmap='YlOrRd', cbar_kws={'label': 'Avg Monetary Value'})
plt.title('RFM Heatmap: Average Monetary Value', fontsize=14, fontweight='bold')
plt.ylabel('Frequency Score')
plt.xlabel('Recency Score')

# 5. Recency vs Frequency Scatter
plt.subplot(2, 3, 5)
scatter = plt.scatter(rfm_segmented['Recency'], rfm_segmented['Frequency'], 
                     c=rfm_segmented['Monetary'], cmap='viridis', alpha=0.6, s=50)
plt.colorbar(scatter, label='Monetary Value (R$)')
plt.xlabel('Recency (Days)')
plt.ylabel('Frequency (Orders)')
plt.title('Customer Distribution: Recency vs Frequency', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# 6. Segment Performance Metrics
plt.subplot(2, 3, 6)
top_segments = segment_analysis.head(5)
x_pos = range(len(top_segments))
plt.bar(x_pos, top_segments['Avg_Order_Value'], color='lightgreen', alpha=0.7)
plt.xticks(x_pos, top_segments.index, rotation=45, ha='right')
plt.ylabel('Average Order Value (R$)')
plt.title('Top 5 Segments by Avg Order Value', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interactive RFM Analysis with Plotly
print("\n🎨 Creating Interactive RFM Dashboard...")

# Create interactive 3D scatter plot
fig_3d = px.scatter_3d(
    rfm_segmented, 
    x='Recency', 
    y='Frequency', 
    z='Monetary',
    color='Customer_Segment',
    size='Avg_Order_Value',
    hover_data=['RFM_Score'],
    title='Interactive 3D RFM Analysis',
    width=900,
    height=600
)

fig_3d.update_layout(
    scene=dict(
        xaxis_title='Recency (Days)',
        yaxis_title='Frequency (Orders)',
        zaxis_title='Monetary Value (R$)'
    )
)

fig_3d.show()

print("✅ RFM Analysis visualizations complete!")

## 4. Customer Journey and Lifecycle Analysis

Understanding how customers evolve over time and their purchasing patterns.

In [None]:
# Customer Journey and Lifecycle Analysis
print("🛤️ Customer Journey and Lifecycle Analysis")
print("=" * 45)

def analyze_customer_lifecycle(data):
    """
    Analyze customer lifecycle patterns and journey stages
    """
    # Customer lifecycle metrics
    customer_lifecycle = data.groupby('customer_unique_id').agg({
        'order_purchase_timestamp': ['min', 'max', 'count'],
        'total_order_value': ['sum', 'mean', 'std'],
        'category_clean': 'nunique',
        'delivery_days': 'mean',
        'review_score': 'mean'
    }).reset_index()
    
    # Flatten column names
    customer_lifecycle.columns = [
        'customer_unique_id', 'first_order_date', 'last_order_date', 'total_orders',
        'total_spent', 'avg_order_value', 'order_value_std', 'categories_purchased',
        'avg_delivery_days', 'avg_review_score'
    ]
    
    # Calculate customer lifetime metrics
    customer_lifecycle['customer_lifespan_days'] = (
        customer_lifecycle['last_order_date'] - customer_lifecycle['first_order_date']
    ).dt.days
    
    # Calculate time since last order (churn risk indicator)
    reference_date = data['order_purchase_timestamp'].max()
    customer_lifecycle['days_since_last_order'] = (
        reference_date - customer_lifecycle['last_order_date']
    ).dt.days
    
    # Customer maturity stages
    def assign_lifecycle_stage(row):
        days_since_last = row['days_since_last_order']
        total_orders = row['total_orders']
        lifespan = row['customer_lifespan_days']
        
        if total_orders == 1:
            if days_since_last <= 30:
                return 'New Customer'
            elif days_since_last <= 90:
                return 'One-time Buyer'
            else:
                return 'Lost New Customer'
        
        elif total_orders <= 3:
            if days_since_last <= 60:
                return 'Developing Customer'
            else:
                return 'At Risk'
        
        elif total_orders <= 7:
            if days_since_last <= 90:
                return 'Regular Customer'
            else:
                return 'Declining Customer'
        
        else:
            if days_since_last <= 120:
                return 'VIP Customer'
            else:
                return 'VIP at Risk'
    
    customer_lifecycle['lifecycle_stage'] = customer_lifecycle.apply(assign_lifecycle_stage, axis=1)
    
    return customer_lifecycle

# Perform lifecycle analysis
lifecycle_data = analyze_customer_lifecycle(customer_df)

print(f"📊 Customer Lifecycle Analysis Results:")
print(f"   • Customers analyzed: {len(lifecycle_data):,}")
print(f"   • Average customer lifespan: {lifecycle_data['customer_lifespan_days'].mean():.1f} days")
print(f"   • Average orders per customer: {lifecycle_data['total_orders'].mean():.1f}")
print(f"   • Average total spent per customer: R$ {lifecycle_data['total_spent'].mean():.2f}")

# Lifecycle stage distribution
stage_distribution = lifecycle_data['lifecycle_stage'].value_counts()
print(f"\n🎯 Customer Lifecycle Stage Distribution:")
for stage, count in stage_distribution.items():
    percentage = (count / len(lifecycle_data)) * 100
    print(f"   • {stage}: {count:,} customers ({percentage:.1f}%)")

# Lifecycle stage performance metrics
stage_performance = lifecycle_data.groupby('lifecycle_stage').agg({
    'total_orders': 'mean',
    'total_spent': 'mean',
    'avg_order_value': 'mean',
    'categories_purchased': 'mean',
    'avg_review_score': 'mean',
    'customer_lifespan_days': 'mean',
    'days_since_last_order': 'mean'
}).round(2)

print(f"\n📈 Lifecycle Stage Performance Metrics:")
display(stage_performance)

In [None]:
# Purchase Pattern and Behavior Analysis
print("🛒 Purchase Pattern and Behavior Analysis")
print("=" * 45)

# Seasonal and temporal purchase patterns
def analyze_purchase_patterns(data):
    """
    Analyze detailed purchase patterns and behaviors
    """
    # Create comprehensive purchase pattern analysis
    pattern_analysis = {
        'temporal': {},
        'category': {},
        'geographic': {},
        'behavioral': {}
    }
    
    # Temporal patterns
    pattern_analysis['temporal']['monthly'] = data.groupby('order_month').agg({
        'total_order_value': ['count', 'mean', 'sum'],
        'review_score': 'mean'
    }).round(2)
    
    pattern_analysis['temporal']['hourly'] = data.groupby('order_hour').agg({
        'total_order_value': ['count', 'mean'],
        'customer_unique_id': 'nunique'
    }).round(2)
    
    pattern_analysis['temporal']['daily'] = data.groupby('order_dow').agg({
        'total_order_value': ['count', 'mean'],
        'customer_unique_id': 'nunique'
    }).round(2)
    
    # Category preferences by customer segment (from RFM)
    customer_segments = rfm_segmented[['customer_unique_id', 'Customer_Segment']]
    data_with_segments = data.merge(customer_segments, on='customer_unique_id', how='left')
    
    pattern_analysis['category']['by_segment'] = data_with_segments.groupby(
        ['Customer_Segment', 'category_clean']
    ).agg({
        'total_order_value': ['count', 'sum', 'mean']
    }).round(2)
    
    # Geographic spending patterns
    pattern_analysis['geographic']['by_state'] = data.groupby('customer_state').agg({
        'total_order_value': ['count', 'mean', 'sum'],
        'customer_unique_id': 'nunique',
        'review_score': 'mean'
    }).round(2)
    
    return pattern_analysis, data_with_segments

# Perform pattern analysis
patterns, customer_data_segmented = analyze_purchase_patterns(customer_df)

# Display key temporal patterns
print("⏰ Temporal Purchase Patterns:")

# Monthly patterns
monthly_data = patterns['temporal']['monthly']
monthly_data.columns = ['Order_Count', 'Avg_Order_Value', 'Total_Revenue', 'Avg_Review']
print("\n📅 Monthly Patterns:")
display(monthly_data)

# Find peak periods
peak_month = monthly_data['Order_Count'].idxmax()
peak_hour = patterns['temporal']['hourly'][('total_order_value', 'count')].idxmax()
peak_day = patterns['temporal']['daily'][('total_order_value', 'count')].idxmax()

day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
peak_day_name = day_names[peak_day]

print(f"\n📊 Peak Activity Periods:")
print(f"   • Peak month: {peak_month} ({monthly_data.loc[peak_month, 'Order_Count']:,} orders)")
print(f"   • Peak hour: {peak_hour}:00 ({patterns['temporal']['hourly'].loc[peak_hour, ('total_order_value', 'count')]:,} orders)")
print(f"   • Peak day: {peak_day_name} ({patterns['temporal']['daily'].loc[peak_day, ('total_order_value', 'count')]:,} orders)")

# Category preferences by customer segment
print(f"\n🏷️ Category Preferences by Customer Segment:")
top_categories_by_segment = {}

for segment in customer_data_segmented['Customer_Segment'].unique():
    if pd.notna(segment):
        segment_data = customer_data_segmented[customer_data_segmented['Customer_Segment'] == segment]
        top_categories = segment_data['category_clean'].value_counts().head(3)
        top_categories_by_segment[segment] = top_categories.to_dict()
        
        print(f"\n   {segment}:")
        for i, (category, count) in enumerate(top_categories.items(), 1):
            percentage = (count / len(segment_data)) * 100
            print(f"     {i}. {category}: {count} purchases ({percentage:.1f}%)")

## 5. Advanced Customer Segmentation using Machine Learning

Beyond RFM, let's use clustering algorithms for more sophisticated segmentation.

In [None]:
# Advanced ML-Based Customer Segmentation
print("🤖 Advanced Machine Learning Customer Segmentation")
print("=" * 55)

def prepare_segmentation_features(customer_data, lifecycle_data, rfm_data):
    """
    Prepare comprehensive feature set for ML-based segmentation
    """
    # Aggregate customer behavioral features
    behavioral_features = customer_data.groupby('customer_unique_id').agg({
        'total_order_value': ['sum', 'mean', 'std', 'count'],
        'category_clean': ['nunique'],
        'delivery_days': ['mean', 'std'],
        'review_score': ['mean', 'std', 'count'],
        'order_month': lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0],  # Preferred month
        'order_hour': lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0],   # Preferred hour
        'order_dow': lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0],    # Preferred day
    }).reset_index()
    
    # Flatten column names
    behavioral_features.columns = [
        'customer_unique_id', 'total_spent', 'avg_order_value', 'order_value_std', 'total_orders',
        'categories_purchased', 'avg_delivery_days', 'delivery_std', 
        'avg_review_score', 'review_score_std', 'review_count',
        'preferred_month', 'preferred_hour', 'preferred_day'
    ]
    
    # Fill NaN values
    behavioral_features = behavioral_features.fillna(0)
    
    # Merge with lifecycle and RFM data
    feature_data = behavioral_features.merge(
        lifecycle_data[['customer_unique_id', 'customer_lifespan_days', 'days_since_last_order']], 
        on='customer_unique_id'
    ).merge(
        rfm_data[['customer_unique_id', 'Recency', 'Frequency', 'Monetary']], 
        on='customer_unique_id'
    )
    
    # Create additional derived features
    feature_data['order_frequency_rate'] = feature_data['total_orders'] / (feature_data['customer_lifespan_days'] + 1)
    feature_data['spending_consistency'] = 1 / (feature_data['order_value_std'] + 1)  # Higher = more consistent
    feature_data['engagement_score'] = (
        feature_data['avg_review_score'] * feature_data['review_count'] / 
        (feature_data['total_orders'] + 1)
    )
    
    return feature_data

# Prepare features
ml_features = prepare_segmentation_features(customer_df, lifecycle_data, rfm_data)

print(f"📊 ML Segmentation Feature Summary:")
print(f"   • Customers: {len(ml_features):,}")
print(f"   • Features: {ml_features.shape[1] - 1} (excluding customer ID)")

# Select features for clustering
clustering_features = [
    'total_spent', 'avg_order_value', 'total_orders', 'categories_purchased',
    'avg_delivery_days', 'avg_review_score', 'customer_lifespan_days', 
    'days_since_last_order', 'order_frequency_rate', 'spending_consistency', 
    'engagement_score'
]

# Prepare data for clustering
X = ml_features[clustering_features].fillna(0)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\n🔧 Feature Engineering Complete:")
print(f"   • Selected {len(clustering_features)} features for clustering")
print(f"   • Features standardized for ML algorithms")

# Display feature importance/correlation with spending
feature_correlations = pd.DataFrame({
    'Feature': clustering_features,
    'Correlation_with_Spending': [X[feature].corr(X['total_spent']) for feature in clustering_features]
}).sort_values('Correlation_with_Spending', key=abs, ascending=False)

print(f"\n📈 Feature Correlations with Total Spending:")
display(feature_correlations.round(3))

In [None]:
# Optimal Cluster Analysis and K-Means Clustering
print("🎯 Optimal Cluster Analysis and K-Means Clustering")
print("=" * 55)

def find_optimal_clusters(X, max_clusters=10):
    """
    Find optimal number of clusters using elbow method and silhouette analysis
    """
    inertias = []
    silhouette_scores = []
    K_range = range(2, max_clusters + 1)
    
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(X, kmeans.labels_))
    
    return K_range, inertias, silhouette_scores

# Find optimal clusters
k_range, inertias, sil_scores = find_optimal_clusters(X_scaled, max_clusters=8)

# Plot elbow curve and silhouette scores
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Elbow curve
ax1.plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Within-Cluster Sum of Squares (WCSS)')
ax1.set_title('Elbow Method for Optimal k', fontweight='bold')
ax1.grid(True, alpha=0.3)

# Silhouette scores
ax2.plot(k_range, sil_scores, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Clusters (k)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis', fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Select optimal k
optimal_k = k_range[np.argmax(sil_scores)]
best_silhouette = max(sil_scores)

print(f"\n🎯 Optimal Clustering Results:")
print(f"   • Optimal number of clusters: {optimal_k}")
print(f"   • Best silhouette score: {best_silhouette:.3f}")

# Perform final clustering
final_kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = final_kmeans.fit_predict(X_scaled)

# Add cluster labels to our data
ml_features['ML_Cluster'] = cluster_labels

# Analyze clusters
cluster_analysis = ml_features.groupby('ML_Cluster')[clustering_features].mean().round(2)
cluster_sizes = ml_features['ML_Cluster'].value_counts().sort_index()

print(f"\n📊 ML-Based Cluster Analysis:")
print(f"\nCluster Sizes:")
for cluster, size in cluster_sizes.items():
    percentage = (size / len(ml_features)) * 100
    print(f"   Cluster {cluster}: {size:,} customers ({percentage:.1f}%)")

print(f"\n📈 Cluster Characteristics:")
display(cluster_analysis)

# Create cluster profiles
def create_cluster_profiles(data, cluster_col):
    """
    Create descriptive profiles for each cluster
    """
    profiles = {}
    
    for cluster in sorted(data[cluster_col].unique()):
        cluster_data = data[data[cluster_col] == cluster]
        
        profile = {
            'size': len(cluster_data),
            'avg_total_spent': cluster_data['total_spent'].mean(),
            'avg_order_value': cluster_data['avg_order_value'].mean(),
            'avg_orders': cluster_data['total_orders'].mean(),
            'avg_categories': cluster_data['categories_purchased'].mean(),
            'avg_review_score': cluster_data['avg_review_score'].mean(),
            'avg_lifespan': cluster_data['customer_lifespan_days'].mean(),
            'days_since_last': cluster_data['days_since_last_order'].mean()
        }
        
        profiles[f'Cluster_{cluster}'] = profile
    
    return profiles

# Create cluster profiles
cluster_profiles = create_cluster_profiles(ml_features, 'ML_Cluster')

print(f"\n🎭 Customer Cluster Profiles:")
for cluster_name, profile in cluster_profiles.items():
    print(f"\n   {cluster_name} ({profile['size']:,} customers):")
    print(f"     • Average total spent: R$ {profile['avg_total_spent']:.2f}")
    print(f"     • Average order value: R$ {profile['avg_order_value']:.2f}")
    print(f"     • Average orders: {profile['avg_orders']:.1f}")
    print(f"     • Categories explored: {profile['avg_categories']:.1f}")
    print(f"     • Average review score: {profile['avg_review_score']:.2f}")
    print(f"     • Customer lifespan: {profile['avg_lifespan']:.0f} days")
    print(f"     • Days since last order: {profile['days_since_last']:.0f}")

## 6. Customer Personas and Business Recommendations

Transform our analytical insights into actionable customer personas and strategic recommendations.

In [None]:
# Customer Personas Creation and Business Strategy
print("👥 Customer Personas and Business Strategy Development")
print("=" * 60)

def create_customer_personas(ml_features, customer_data_segmented):
    """
    Create detailed customer personas based on clustering and behavioral analysis
    """
    personas = {}
    
    # Merge cluster data with original transaction data for deeper insights
    persona_data = customer_data_segmented.merge(
        ml_features[['customer_unique_id', 'ML_Cluster']], 
        on='customer_unique_id', 
        how='left'
    )
    
    for cluster in sorted(ml_features['ML_Cluster'].unique()):
        cluster_customers = ml_features[ml_features['ML_Cluster'] == cluster]
        cluster_transactions = persona_data[persona_data['ML_Cluster'] == cluster]
        
        # Calculate persona characteristics
        persona = {
            'cluster_id': cluster,
            'size': len(cluster_customers),
            'percentage': (len(cluster_customers) / len(ml_features)) * 100,
            
            # Financial characteristics
            'avg_total_spent': cluster_customers['total_spent'].mean(),
            'median_total_spent': cluster_customers['total_spent'].median(),
            'avg_order_value': cluster_customers['avg_order_value'].mean(),
            'total_revenue_contribution': cluster_customers['total_spent'].sum(),
            
            # Behavioral characteristics
            'avg_orders': cluster_customers['total_orders'].mean(),
            'avg_categories': cluster_customers['categories_purchased'].mean(),
            'avg_review_score': cluster_customers['avg_review_score'].mean(),
            'avg_engagement': cluster_customers['engagement_score'].mean(),
            
            # Lifecycle characteristics
            'avg_lifespan_days': cluster_customers['customer_lifespan_days'].mean(),
            'avg_days_since_last': cluster_customers['days_since_last_order'].mean(),
            'order_frequency_rate': cluster_customers['order_frequency_rate'].mean(),
            
            # Geographic and temporal preferences
            'top_states': cluster_transactions['customer_state'].value_counts().head(3).to_dict(),
            'top_categories': cluster_transactions['category_clean'].value_counts().head(5).to_dict(),
            'preferred_order_hour': cluster_customers['preferred_hour'].mode().iloc[0] if len(cluster_customers) > 0 else 0,
            'preferred_order_day': cluster_customers['preferred_day'].mode().iloc[0] if len(cluster_customers) > 0 else 0,
            
            # Satisfaction metrics
            'satisfaction_distribution': cluster_transactions['satisfaction_level'].value_counts().to_dict()
        }
        
        personas[f'Cluster_{cluster}'] = persona
    
    return personas

# Create customer personas
customer_personas = create_customer_personas(ml_features, customer_data_segmented)

# Calculate total revenue for percentage calculations
total_revenue = sum([persona['total_revenue_contribution'] for persona in customer_personas.values()])

# Enhanced persona descriptions with business insights
def generate_persona_insights(personas, total_revenue):
    """
    Generate business insights and recommendations for each persona
    """
    insights = {}
    
    for persona_name, persona in personas.items():
        revenue_percentage = (persona['total_revenue_contribution'] / total_revenue) * 100
        
        # Determine persona archetype based on characteristics
        if persona['avg_total_spent'] > 500 and persona['avg_orders'] > 5:
            archetype = "VIP Champions"
            description = "High-value, loyal customers who drive significant revenue"
        elif persona['avg_orders'] > 3 and persona['avg_review_score'] > 4:
            archetype = "Loyal Advocates"
            description = "Satisfied repeat customers who could become brand ambassadors"
        elif persona['avg_days_since_last'] > 120:
            archetype = "At-Risk Customers"
            description = "Previously active customers who haven't purchased recently"
        elif persona['avg_orders'] <= 1.5:
            archetype = "New/One-Time Buyers"
            description = "Customers who have made few purchases, potential for growth"
        elif persona['avg_order_value'] < 50:
            archetype = "Budget Shoppers"
            description = "Price-conscious customers who make smaller purchases"
        else:
            archetype = "Regular Customers"
            description = "Steady customers with moderate purchasing behavior"
        
        # Generate recommendations
        recommendations = []
        
        if "VIP" in archetype:
            recommendations = [
                "Offer exclusive products and early access to sales",
                "Implement premium customer service tier",
                "Create VIP loyalty program with enhanced benefits",
                "Personalized recommendations based on purchase history"
            ]
        elif "At-Risk" in archetype:
            recommendations = [
                "Send targeted win-back campaigns with special offers",
                "Implement cart abandonment recovery sequences",
                "Survey to understand reasons for decreased activity",
                "Offer time-limited discounts to encourage re-engagement"
            ]
        elif "New" in archetype or "One-Time" in archetype:
            recommendations = [
                "Create onboarding email sequences",
                "Offer first-time buyer incentives for second purchase",
                "Showcase popular products and categories",
                "Implement retargeting campaigns"
            ]
        elif "Budget" in archetype:
            recommendations = [
                "Highlight value propositions and cost savings",
                "Promote bulk buying and bundle offers",
                "Send notifications about sales and clearance items",
                "Focus on affordability in messaging"
            ]
        else:
            recommendations = [
                "Maintain consistent engagement with regular promotions",
                "Cross-sell complementary products",
                "Encourage category exploration",
                "Build loyalty through consistent experience"
            ]
        
        insights[persona_name] = {
            'archetype': archetype,
            'description': description,
            'revenue_percentage': revenue_percentage,
            'recommendations': recommendations
        }
    
    return insights

# Generate persona insights
persona_insights = generate_persona_insights(customer_personas, total_revenue)

# Display comprehensive persona analysis
print(f"\n🎭 COMPREHENSIVE CUSTOMER PERSONA ANALYSIS")
print(f"=" * 60)

for persona_name, persona in customer_personas.items():
    insight = persona_insights[persona_name]
    
    print(f"\n📊 {persona_name.upper()}: {insight['archetype']}")
    print(f"   {insight['description']}")
    print("-" * 50)
    
    print(f"   👥 Size: {persona['size']:,} customers ({persona['percentage']:.1f}% of customer base)")
    print(f"   💰 Revenue: R$ {persona['total_revenue_contribution']:,.2f} ({insight['revenue_percentage']:.1f}% of total)")
    print(f"   💵 Avg Total Spent: R$ {persona['avg_total_spent']:.2f}")
    print(f"   🛒 Avg Order Value: R$ {persona['avg_order_value']:.2f}")
    print(f"   📦 Avg Orders: {persona['avg_orders']:.1f}")
    print(f"   ⭐ Avg Review Score: {persona['avg_review_score']:.2f}")
    print(f"   📅 Days Since Last Order: {persona['avg_days_since_last']:.0f}")
    
    print(f"\n   🏷️ Top Categories:")
    for i, (category, count) in enumerate(list(persona['top_categories'].items())[:3], 1):
        print(f"     {i}. {category}: {count} purchases")
    
    print(f"\n   🎯 Strategic Recommendations:")
    for i, rec in enumerate(insight['recommendations'], 1):
        print(f"     {i}. {rec}")
    
    print("\n" + "="*60)

In [None]:
# Comprehensive Business Action Plan
print("📋 COMPREHENSIVE BUSINESS ACTION PLAN")
print("=" * 50)

# Calculate key business metrics
total_customers = len(ml_features)
total_revenue = ml_features['total_spent'].sum()
avg_customer_value = total_revenue / total_customers

# Identify priority segments
revenue_by_cluster = {}
for persona_name, persona in customer_personas.items():
    revenue_by_cluster[persona_name] = persona['total_revenue_contribution']

# Sort by revenue contribution
priority_segments = sorted(revenue_by_cluster.items(), key=lambda x: x[1], reverse=True)

print(f"\n💼 EXECUTIVE SUMMARY:")
print(f"   • Total customers analyzed: {total_customers:,}")
print(f"   • Total revenue: R$ {total_revenue:,.2f}")
print(f"   • Average customer lifetime value: R$ {avg_customer_value:.2f}")
print(f"   • Customer segments identified: {len(customer_personas)}")

print(f"\n🏆 REVENUE PRIORITY RANKING:")
for i, (segment, revenue) in enumerate(priority_segments, 1):
    percentage = (revenue / total_revenue) * 100
    archetype = persona_insights[segment]['archetype']
    print(f"   {i}. {segment} ({archetype}): R$ {revenue:,.2f} ({percentage:.1f}%)")

# Strategic initiatives
print(f"\n🎯 STRATEGIC INITIATIVES BY PRIORITY:")

print(f"\n   1. VIP CUSTOMER RETENTION (High Priority):")
vip_segments = [s for s in persona_insights.keys() if 'VIP' in persona_insights[s]['archetype']]
if vip_segments:
    vip_revenue = sum([customer_personas[s]['total_revenue_contribution'] for s in vip_segments])
    vip_percentage = (vip_revenue / total_revenue) * 100
    print(f"     • Target: {len(vip_segments)} VIP segments contributing {vip_percentage:.1f}% of revenue")
    print(f"     • Action: Implement premium loyalty program")
    print(f"     • Expected impact: Increase VIP retention by 15-25%")

print(f"\n   2. AT-RISK CUSTOMER RECOVERY (High Priority):")
at_risk_segments = [s for s in persona_insights.keys() if 'At-Risk' in persona_insights[s]['archetype']]
if at_risk_segments:
    at_risk_customers = sum([customer_personas[s]['size'] for s in at_risk_segments])
    at_risk_percentage = (at_risk_customers / total_customers) * 100
    print(f"     • Target: {at_risk_customers:,} at-risk customers ({at_risk_percentage:.1f}% of base)")
    print(f"     • Action: Launch win-back campaigns")
    print(f"     • Expected impact: Recover 10-20% of at-risk customers")

print(f"\n   3. NEW CUSTOMER CONVERSION (Medium Priority):")
new_segments = [s for s in persona_insights.keys() if 'New' in persona_insights[s]['archetype'] or 'One-Time' in persona_insights[s]['archetype']]
if new_segments:
    new_customers = sum([customer_personas[s]['size'] for s in new_segments])
    new_percentage = (new_customers / total_customers) * 100
    print(f"     • Target: {new_customers:,} new/one-time customers ({new_percentage:.1f}% of base)")
    print(f"     • Action: Optimize onboarding and second purchase incentives")
    print(f"     • Expected impact: Increase conversion rate by 5-15%")

# Key Performance Indicators (KPIs)
print(f"\n📊 RECOMMENDED KPIs FOR MONITORING:")
print(f"   • Customer Lifetime Value (CLV) by segment")
print(f"   • Churn rate by customer persona")
print(f"   • Average days between orders")
print(f"   • Cross-category purchase rate")
print(f"   • Customer satisfaction score by segment")
print(f"   • Revenue per customer segment")
print(f"   • Customer migration between segments")

# Implementation timeline
print(f"\n📅 IMPLEMENTATION TIMELINE:")
print(f"   • Week 1-2: Set up segment-based analytics tracking")
print(f"   • Week 3-4: Launch VIP customer program")
print(f"   • Week 5-6: Implement at-risk customer campaigns")
print(f"   • Week 7-8: Deploy new customer onboarding sequence")
print(f"   • Week 9-12: Monitor results and optimize campaigns")

print(f"\n✅ Customer Behavior Analysis Complete!")
print(f"   Ready for advanced product performance and time series analysis.")

## Summary - Customer Behavior Analysis and Segmentation

### What We've Accomplished

1. **✅ Comprehensive Customer Data Analysis**: Loaded and analyzed 25,000+ customer transaction records
2. **✅ RFM Analysis Implementation**: Systematic customer value segmentation using Recency, Frequency, Monetary metrics
3. **✅ Customer Lifecycle Analysis**: Understanding customer journey stages and retention patterns
4. **✅ Advanced ML Segmentation**: K-means clustering with feature engineering for sophisticated customer groups
5. **✅ Customer Personas Creation**: Detailed business profiles with actionable insights
6. **✅ Strategic Business Recommendations**: Priority-based action plan for customer retention and growth

### Key Business Insights Discovered

**Customer Value Distribution:**
- Clear identification of high-value customer segments
- Revenue concentration analysis for strategic focus
- Customer lifetime value patterns

**Behavioral Patterns:**
- Purchase timing and frequency preferences
- Category exploration and loyalty indicators
- Satisfaction correlation with customer behavior

**Risk Assessment:**
- At-risk customer identification for retention campaigns
- Churn pattern recognition
- Customer lifecycle stage analysis

### Advanced Techniques Mastered

- **RFM Scoring**: Quintile-based customer value assessment
- **Machine Learning Segmentation**: K-means clustering with feature engineering
- **Customer Journey Mapping**: Lifecycle stage identification
- **Behavioral Feature Engineering**: Creating predictive customer metrics
- **Persona Development**: Translating data into actionable business profiles

### Next Steps
In Part 2, we'll explore:
- Product performance metrics and category analysis
- Cross-selling and upselling opportunity identification
- Product lifecycle and performance optimization strategies

## 🎯 Practice Exercises - Customer Behavior Analysis

Strengthen your customer analytics skills:

1. **Custom RFM Analysis**: Modify the RFM scoring system to use different quantiles or weighted scores

2. **Geographic Segmentation**: Analyze customer behavior differences by state/region

3. **Seasonal Behavior Analysis**: Identify customer segments with different seasonal purchasing patterns

4. **Churn Prediction Model**: Create a simple model to predict customer churn risk

5. **Customer Journey Optimization**: Propose improvements to customer lifecycle progression

In [None]:
# Exercise Space - Customer Behavior Analysis
# Use this space to practice the customer analytics techniques

# Exercise 1: Custom RFM Analysis
# Modify the RFM scoring system

# Exercise 2: Geographic Segmentation
# Analyze behavior differences by location

# Exercise 3: Seasonal Behavior Analysis
# Identify seasonal purchasing patterns

# Exercise 4: Churn Prediction Model
# Create a simple churn risk model

# Exercise 5: Customer Journey Optimization
# Propose lifecycle improvements