# Advanced Matplotlib and Seaborn - Part 1: Seaborn for Statistical Visualization

**Week 5 Thursday - May 8, 2025**

## Learning Objectives
By the end of this session, you will be able to:
1. Create statistical visualizations using Seaborn's powerful plotting functions
2. Analyze data distributions and relationships for business insights
3. Build professional statistical dashboards for e-commerce analytics
4. Translate SQL analytical queries into visual Python insights

## Why Seaborn for Business Analytics?

**Statistical Foundation:**
- **Built-in Statistics**: Automatic calculation of confidence intervals, regression lines, and distributions
- **Business KPIs**: Perfect for customer analytics, sales performance, and market research
- **Professional Styling**: Publication-ready plots with minimal code
- **Data-Driven Insights**: Statistical annotations help guide business decisions

**Real-World Applications:**
- **Customer Segmentation**: Distribution analysis and clustering visualization
- **Sales Performance**: Trend analysis with statistical confidence intervals
- **Market Research**: Correlation analysis and comparative studies
- **Executive Reporting**: Professional statistical summaries for stakeholders

## Setup and Data Preparation

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta
from scipy import stats
import zipfile
import requests
from io import BytesIO

# Configure display and plotting settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
warnings.filterwarnings('ignore')

# Set seaborn style and color palette
sns.set_style("whitegrid")
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Libraries imported successfully!")
print("📊 Seaborn configured for statistical visualization")
print("🎨 Ready to explore data distributions and relationships")

In [None]:
# Load comprehensive Olist sample data for statistical analysis
def create_statistical_dataset():
    """
    Create a comprehensive e-commerce dataset for statistical visualization
    Simulates real Olist Brazilian marketplace data patterns
    """
    print("📊 Creating comprehensive e-commerce dataset for statistical analysis...")
    
    np.random.seed(42)
    
    # Dataset parameters
    n_customers = 3000
    n_orders = 8000
    n_products = 500
    
    # Brazilian states with realistic population distribution
    states = {
        'SP': 0.35, 'RJ': 0.12, 'MG': 0.10, 'RS': 0.08, 'PR': 0.07,
        'SC': 0.06, 'BA': 0.05, 'GO': 0.04, 'PE': 0.04, 'CE': 0.09
    }
    
    # Product categories with market share
    categories = {
        'electronics': 0.18, 'home_garden': 0.15, 'sports_leisure': 0.12,
        'health_beauty': 0.11, 'fashion_bags': 0.10, 'computers': 0.08,
        'auto': 0.07, 'toys': 0.06, 'furniture': 0.08, 'books': 0.05
    }
    
    # Payment methods with usage distribution
    payment_methods = {
        'credit_card': 0.73, 'boleto': 0.19, 'debit_card': 0.06, 'voucher': 0.02
    }
    
    # Generate customer base
    customers = []
    for i in range(n_customers):
        customer = {
            'customer_id': f'customer_{i:04d}',
            'customer_state': np.random.choice(list(states.keys()), p=list(states.values())),
            'customer_lifetime_days': np.random.gamma(2, 180),  # Customer tenure
            'customer_segment': np.random.choice(['Premium', 'Standard', 'Budget'], p=[0.2, 0.5, 0.3])
        }
        customers.append(customer)
    
    customers_df = pd.DataFrame(customers)
    
    # Generate orders with realistic patterns
    orders = []
    for i in range(n_orders):
        # Select customer and get their characteristics
        customer = customers_df.sample(1).iloc[0]
        
        # Generate order date with seasonality
        base_date = datetime(2023, 1, 1)
        days_offset = np.random.randint(0, 730)  # 2 years of data
        order_date = base_date + timedelta(days=days_offset)
        
        # Seasonal multipliers (Q4 holiday season boost)
        seasonal_boost = 1.4 if order_date.month in [11, 12] else 1.0
        
        # Customer segment effects
        segment_multiplier = {
            'Premium': 2.5, 'Standard': 1.0, 'Budget': 0.6
        }[customer['customer_segment']]
        
        # State economic effects
        state_multiplier = {
            'SP': 1.3, 'RJ': 1.2, 'MG': 1.0, 'RS': 1.1, 'PR': 1.0,
            'SC': 1.1, 'BA': 0.8, 'GO': 0.9, 'PE': 0.7, 'CE': 0.7
        }[customer['customer_state']]
        
        # Generate order value with log-normal distribution
        base_value = np.random.lognormal(3.5, 0.8)
        order_value = base_value * seasonal_boost * segment_multiplier * state_multiplier
        
        # Select category with realistic probabilities
        category = np.random.choice(list(categories.keys()), p=list(categories.values()))
        
        # Category-specific adjustments
        category_adjustments = {
            'electronics': 1.5, 'computers': 2.0, 'furniture': 1.8,
            'auto': 1.3, 'books': 0.4, 'toys': 0.6
        }
        order_value *= category_adjustments.get(category, 1.0)
        
        # Generate other order characteristics
        order = {
            'order_id': f'order_{i:06d}',
            'customer_id': customer['customer_id'],
            'customer_state': customer['customer_state'],
            'customer_segment': customer['customer_segment'],
            'order_date': order_date,
            'order_month': order_date.to_period('M'),
            'order_quarter': f"Q{order_date.quarter} {order_date.year}",
            'order_value': round(order_value, 2),
            'product_category': category,
            'payment_type': np.random.choice(list(payment_methods.keys()), p=list(payment_methods.values())),
            'freight_value': np.random.gamma(2, 8) + 5,  # Shipping cost
            'item_count': np.random.poisson(2) + 1,
            'delivery_days': np.random.gamma(3, 4) + 1,
            'review_score': np.random.choice([1, 2, 3, 4, 5], p=[0.04, 0.06, 0.15, 0.30, 0.45])
        }
        
        # Delivery performance affects review scores
        if order['delivery_days'] > 20:
            order['review_score'] = max(1, order['review_score'] - np.random.randint(1, 3))
        elif order['delivery_days'] < 7:
            order['review_score'] = min(5, order['review_score'] + np.random.randint(0, 2))
        
        orders.append(order)
    
    orders_df = pd.DataFrame(orders)
    
    # Add derived business metrics
    orders_df['freight_ratio'] = orders_df['freight_value'] / orders_df['order_value']
    orders_df['total_value'] = orders_df['order_value'] + orders_df['freight_value']
    orders_df['value_per_item'] = orders_df['order_value'] / orders_df['item_count']
    
    # Create value tiers for segmentation
    orders_df['value_tier'] = pd.cut(orders_df['order_value'], 
                                   bins=[0, 50, 150, 400, float('inf')],
                                   labels=['Low', 'Medium', 'High', 'Premium'])
    
    # Satisfaction levels
    orders_df['satisfaction'] = pd.cut(orders_df['review_score'],
                                     bins=[0, 2, 3, 4, 5],
                                     labels=['Poor', 'Fair', 'Good', 'Excellent'])
    
    # Delivery performance tiers
    orders_df['delivery_performance'] = pd.cut(orders_df['delivery_days'],
                                             bins=[0, 7, 14, 21, float('inf')],
                                             labels=['Fast', 'Standard', 'Slow', 'Very Slow'])
    
    return orders_df

# Create the comprehensive dataset
ecommerce_data = create_statistical_dataset()

print(f"✅ Dataset created successfully!")
print(f"📦 Total orders: {len(ecommerce_data):,}")
print(f"👥 Unique customers: {ecommerce_data['customer_id'].nunique():,}")
print(f"🛍️ Product categories: {ecommerce_data['product_category'].nunique()}")
print(f"🌎 States covered: {ecommerce_data['customer_state'].nunique()}")
print(f"💰 Order value range: R$ {ecommerce_data['order_value'].min():.2f} - R$ {ecommerce_data['order_value'].max():.2f}")
print(f"📅 Date range: {ecommerce_data['order_date'].min().date()} to {ecommerce_data['order_date'].max().date()}")

# Display sample data
print("\n📊 Sample data preview:")
display(ecommerce_data.head())

## 1. Distribution Analysis with Seaborn (15 minutes)

### Understanding Data Distributions

**Business Value:**
- **Customer Insights**: Understanding spending patterns and behavior distributions
- **Pricing Strategy**: Analyzing order value distributions to optimize pricing tiers
- **Market Segmentation**: Identifying natural customer segments through distribution analysis
- **Performance Metrics**: Evaluating delivery times, satisfaction scores, and operational KPIs

**Key Seaborn Functions:**
- `histplot()`: Histograms with automatic binning and statistical overlays
- `kdeplot()`: Kernel density estimation for smooth distribution curves
- `boxplot()` and `violinplot()`: Distribution summaries with quartiles and shape
- `displot()`: Figure-level distribution plots with automatic faceting

In [None]:
# 1.1 Order Value Distribution Analysis

# Create comprehensive distribution analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('E-commerce Order Value Distribution Analysis\nUnderstanding Customer Spending Patterns', 
             fontsize=16, fontweight='bold', y=0.98)

# Histogram with KDE overlay
sns.histplot(data=ecommerce_data, x='order_value', kde=True, ax=axes[0,0])
axes[0,0].axvline(ecommerce_data['order_value'].mean(), color='red', linestyle='--', 
                  label=f'Mean: R$ {ecommerce_data["order_value"].mean():.2f}')
axes[0,0].axvline(ecommerce_data['order_value'].median(), color='orange', linestyle='--', 
                  label=f'Median: R$ {ecommerce_data["order_value"].median():.2f}')
axes[0,0].set_title('Order Value Distribution\nHistogram with Statistical Overlays')
axes[0,0].set_xlabel('Order Value (R$)')
axes[0,0].legend()

# Log-scale distribution (common for financial data)
sns.histplot(data=ecommerce_data, x='order_value', kde=True, log_scale=True, ax=axes[0,1])
axes[0,1].set_title('Order Value Distribution (Log Scale)\nRevealing Underlying Patterns')
axes[0,1].set_xlabel('Order Value (R$) - Log Scale')

# Box plot by customer segment
sns.boxplot(data=ecommerce_data, x='customer_segment', y='order_value', ax=axes[1,0])
axes[1,0].set_title('Order Value by Customer Segment\nIdentifying Spending Patterns')
axes[1,0].set_ylabel('Order Value (R$)')

# Violin plot showing distribution shape by segment
sns.violinplot(data=ecommerce_data, x='customer_segment', y='order_value', ax=axes[1,1])
axes[1,1].set_title('Order Value Distribution Shapes\nDetailed Segment Analysis')
axes[1,1].set_ylabel('Order Value (R$)')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Statistical summary
print("📊 ORDER VALUE DISTRIBUTION INSIGHTS:")
print("=" * 50)

stats_summary = ecommerce_data['order_value'].describe()
print(f"Mean: R$ {stats_summary['mean']:.2f}")
print(f"Median: R$ {stats_summary['50%']:.2f}")
print(f"Standard Deviation: R$ {stats_summary['std']:.2f}")
print(f"Skewness: {ecommerce_data['order_value'].skew():.2f} (Right-skewed typical for financial data)")

print("\n💼 BUSINESS INSIGHTS:")
print(f"• Median < Mean indicates right-skewed distribution (few high-value orders)")
print(f"• 75% of orders are below R$ {stats_summary['75%']:.2f}")
print(f"• Premium customers show significantly higher order values")
print(f"• Log transformation reveals underlying normal distribution pattern")

In [None]:
# 1.2 Delivery Performance Distribution Analysis

# Create multi-faceted delivery analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Delivery Performance Analysis\nOperational Excellence Metrics', 
             fontsize=16, fontweight='bold', y=0.98)

# Delivery days distribution
sns.histplot(data=ecommerce_data, x='delivery_days', bins=30, kde=True, ax=axes[0,0])
axes[0,0].axvline(ecommerce_data['delivery_days'].mean(), color='red', linestyle='--', 
                  label=f'Mean: {ecommerce_data["delivery_days"].mean():.1f} days')
axes[0,0].axvline(7, color='green', linestyle='--', alpha=0.7, label='Target: 7 days')
axes[0,0].set_title('Delivery Time Distribution\nVs. Performance Target')
axes[0,0].set_xlabel('Delivery Days')
axes[0,0].legend()

# Delivery performance by state
top_states = ecommerce_data['customer_state'].value_counts().head(6).index
state_data = ecommerce_data[ecommerce_data['customer_state'].isin(top_states)]
sns.boxplot(data=state_data, x='customer_state', y='delivery_days', ax=axes[0,1])
axes[0,1].set_title('Delivery Performance by State\nRegional Logistics Analysis')
axes[0,1].set_ylabel('Delivery Days')
axes[0,1].tick_params(axis='x', rotation=45)

# Review score distribution
review_counts = ecommerce_data['review_score'].value_counts().sort_index()
sns.barplot(x=review_counts.index, y=review_counts.values, ax=axes[1,0])
axes[1,0].set_title('Customer Review Score Distribution\nSatisfaction Metrics')
axes[1,0].set_xlabel('Review Score (1-5 stars)')
axes[1,0].set_ylabel('Number of Orders')

# Relationship: Delivery time vs Review score
sns.boxplot(data=ecommerce_data, x='review_score', y='delivery_days', ax=axes[1,1])
axes[1,1].set_title('Delivery Time Impact on Satisfaction\nOperational Quality Connection')
axes[1,1].set_xlabel('Review Score')
axes[1,1].set_ylabel('Delivery Days')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Performance metrics
print("🚚 DELIVERY PERFORMANCE INSIGHTS:")
print("=" * 50)

fast_delivery = (ecommerce_data['delivery_days'] <= 7).mean() * 100
avg_rating = ecommerce_data['review_score'].mean()
high_satisfaction = (ecommerce_data['review_score'] >= 4).mean() * 100

print(f"Average delivery time: {ecommerce_data['delivery_days'].mean():.1f} days")
print(f"Fast delivery rate (≤7 days): {fast_delivery:.1f}%")
print(f"Average customer rating: {avg_rating:.2f}/5.0")
print(f"High satisfaction rate (≥4 stars): {high_satisfaction:.1f}%")

# Correlation analysis
correlation = ecommerce_data['delivery_days'].corr(ecommerce_data['review_score'])
print(f"\n📈 Delivery-Satisfaction Correlation: {correlation:.3f}")
print(f"Interpretation: {'Strong negative' if correlation < -0.5 else 'Moderate negative' if correlation < -0.3 else 'Weak'} relationship")

## 2. Relationship Analysis and Correlation (15 minutes)

### Statistical Relationships in Business Data

**Business Applications:**
- **Price Optimization**: Understanding value-volume relationships
- **Customer Behavior**: Correlating satisfaction with business metrics
- **Market Analysis**: Category performance and seasonal patterns
- **Operational Efficiency**: Delivery performance impact on satisfaction

**Key Seaborn Functions:**
- `scatterplot()`: Exploring relationships between continuous variables
- `regplot()`: Adding regression lines with confidence intervals
- `heatmap()`: Correlation matrices for multiple variables
- `pairplot()`: Comprehensive relationship exploration

In [None]:
# 2.1 Business Metrics Correlation Analysis

# Select key business metrics for correlation analysis
business_metrics = ecommerce_data[[
    'order_value', 'freight_value', 'item_count', 'delivery_days', 
    'review_score', 'freight_ratio', 'value_per_item'
]].copy()

# Calculate correlation matrix
correlation_matrix = business_metrics.corr()

# Create comprehensive correlation visualization
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
fig.suptitle('Business Metrics Correlation Analysis\nIdentifying Key Relationships for Strategic Decisions', 
             fontsize=16, fontweight='bold', y=1.02)

# Correlation heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Hide upper triangle
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, ax=axes[0], cbar_kws={"shrink": .8})
axes[0].set_title('Correlation Matrix\nBusiness Metrics Relationships', fontweight='bold')

# Clustermap for hierarchical relationships
plt.subplot(1, 2, 2)
cluster_data = correlation_matrix.copy()
sns.clustermap(cluster_data, annot=True, cmap='RdBu_r', center=0, 
               figsize=(8, 6), cbar_pos=(0.02, 0.83, 0.03, 0.15))
plt.title('Hierarchical Clustering\nMetric Relationships', fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("🔍 CORRELATION INSIGHTS:")
print("=" * 50)

# Find strongest correlations
correlation_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        correlation_pairs.append({
            'var1': correlation_matrix.columns[i],
            'var2': correlation_matrix.columns[j],
            'correlation': corr_value
        })

correlation_df = pd.DataFrame(correlation_pairs)
correlation_df['abs_correlation'] = correlation_df['correlation'].abs()
top_correlations = correlation_df.nlargest(5, 'abs_correlation')

print("\n🔗 STRONGEST RELATIONSHIPS:")
for _, row in top_correlations.iterrows():
    direction = "Positive" if row['correlation'] > 0 else "Negative"
    strength = "Strong" if abs(row['correlation']) > 0.7 else "Moderate" if abs(row['correlation']) > 0.4 else "Weak"
    print(f"• {row['var1']} ↔ {row['var2']}: {row['correlation']:.3f} ({strength} {direction.lower()})")

print("\n💼 BUSINESS IMPLICATIONS:")
print("• Strong positive correlations suggest complementary metrics")
print("• Negative correlations indicate trade-offs or operational impacts")
print("• Use these insights for pricing strategy and operational optimization")

In [None]:
# 2.2 Advanced Relationship Visualization

# Create detailed scatter plot analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Advanced Relationship Analysis\nStatistical Insights for Business Strategy', 
             fontsize=16, fontweight='bold', y=0.98)

# Order value vs review score with regression
sns.regplot(data=ecommerce_data, x='order_value', y='review_score', ax=axes[0,0],
            scatter_kws={'alpha':0.6}, line_kws={'color':'red'})
axes[0,0].set_title('Order Value vs Customer Satisfaction\nDoes Spending Correlate with Happiness?')
axes[0,0].set_xlabel('Order Value (R$)')
axes[0,0].set_ylabel('Review Score')

# Delivery days vs review score
sns.regplot(data=ecommerce_data, x='delivery_days', y='review_score', ax=axes[0,1],
            scatter_kws={'alpha':0.6}, line_kws={'color':'red'})
axes[0,1].set_title('Delivery Speed vs Satisfaction\nOperational Impact on Customer Experience')
axes[0,1].set_xlabel('Delivery Days')
axes[0,1].set_ylabel('Review Score')

# Freight ratio vs order value (colored by segment)
sns.scatterplot(data=ecommerce_data, x='order_value', y='freight_ratio', 
                hue='customer_segment', ax=axes[1,0], alpha=0.7)
axes[1,0].set_title('Shipping Cost Efficiency\nFreight Ratio by Order Value and Segment')
axes[1,0].set_xlabel('Order Value (R$)')
axes[1,0].set_ylabel('Freight Ratio (Freight/Order Value)')
axes[1,0].legend(title='Customer Segment')

# Item count vs order value with category distinction
# Sample data for readability
sample_data = ecommerce_data.sample(1000)
sns.scatterplot(data=sample_data, x='item_count', y='order_value', 
                hue='product_category', ax=axes[1,1], alpha=0.7)
axes[1,1].set_title('Basket Size Analysis\nItems per Order vs Total Value')
axes[1,1].set_xlabel('Number of Items')
axes[1,1].set_ylabel('Order Value (R$)')
axes[1,1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Statistical significance testing
print("📈 STATISTICAL RELATIONSHIP ANALYSIS:")
print("=" * 50)

# Pearson correlations with significance
from scipy.stats import pearsonr

relationships = [
    ('order_value', 'review_score', 'Order Value ↔ Satisfaction'),
    ('delivery_days', 'review_score', 'Delivery Speed ↔ Satisfaction'),
    ('freight_ratio', 'review_score', 'Shipping Efficiency ↔ Satisfaction'),
    ('item_count', 'order_value', 'Basket Size ↔ Order Value')
]

for var1, var2, description in relationships:
    correlation, p_value = pearsonr(ecommerce_data[var1], ecommerce_data[var2])
    significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "ns"
    print(f"{description}: r = {correlation:.3f} {significance}")
    
print("\n📊 SIGNIFICANCE LEVELS: *** p<0.001, ** p<0.01, * p<0.05, ns = not significant")

print("\n💡 BUSINESS STRATEGY INSIGHTS:")
print("• Strong delivery-satisfaction correlation suggests prioritizing logistics")
print("• Freight ratio optimization opportunities for customer segments")
print("• Category-specific basket size patterns inform cross-selling strategy")

## 3. Categorical Data Analysis (10 minutes)

### Business Category Performance Analysis

**Strategic Applications:**
- **Market Share Analysis**: Understanding category performance and trends
- **Customer Segmentation**: Behavior patterns across different groups
- **Regional Strategy**: Geographic performance variations
- **Seasonal Planning**: Category-specific seasonal patterns

**Key Seaborn Functions:**
- `countplot()`: Category frequency analysis
- `barplot()`: Aggregated metrics by category
- `catplot()`: Multi-faceted categorical analysis
- `pointplot()`: Trend analysis across categories

In [None]:
# 3.1 Product Category Performance Analysis

# Create comprehensive category analysis
fig, axes = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle('Product Category Performance Analysis\nMarket Share and Profitability Insights', 
             fontsize=16, fontweight='bold', y=0.98)

# Category market share (order count)
category_counts = ecommerce_data['product_category'].value_counts()
sns.barplot(x=category_counts.values, y=category_counts.index, ax=axes[0,0])
axes[0,0].set_title('Market Share by Order Volume\nCategory Popularity Ranking')
axes[0,0].set_xlabel('Number of Orders')
axes[0,0].set_ylabel('Product Category')

# Average order value by category
category_avg_value = ecommerce_data.groupby('product_category')['order_value'].mean().sort_values(ascending=True)
sns.barplot(x=category_avg_value.values, y=category_avg_value.index, ax=axes[0,1], 
            palette='viridis')
axes[0,1].set_title('Average Order Value by Category\nRevenue Quality Metrics')
axes[0,1].set_xlabel('Average Order Value (R$)')
axes[0,1].set_ylabel('Product Category')

# Customer satisfaction by category
sns.boxplot(data=ecommerce_data, y='product_category', x='review_score', ax=axes[1,0])
axes[1,0].set_title('Customer Satisfaction by Category\nQuality Perception Analysis')
axes[1,0].set_xlabel('Review Score')
axes[1,0].set_ylabel('Product Category')

# Category performance matrix (value vs satisfaction)
category_summary = ecommerce_data.groupby('product_category').agg({
    'order_value': 'mean',
    'review_score': 'mean',
    'order_id': 'count'
}).reset_index()
category_summary.columns = ['category', 'avg_value', 'avg_satisfaction', 'order_count']

sns.scatterplot(data=category_summary, x='avg_value', y='avg_satisfaction', 
                size='order_count', sizes=(100, 1000), alpha=0.7, ax=axes[1,1])
axes[1,1].set_title('Category Performance Matrix\nValue vs Satisfaction (Size = Volume)')
axes[1,1].set_xlabel('Average Order Value (R$)')
axes[1,1].set_ylabel('Average Satisfaction Score')

# Add category labels to scatter plot
for idx, row in category_summary.iterrows():
    axes[1,1].annotate(row['category'][:8], (row['avg_value'], row['avg_satisfaction']), 
                      xytext=(5, 5), textcoords='offset points', fontsize=8, alpha=0.8)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Category insights summary
print("🛍️ CATEGORY PERFORMANCE INSIGHTS:")
print("=" * 50)

print("\n📊 TOP PERFORMING CATEGORIES:")
print("\nBy Volume (Orders):")
for i, (category, count) in enumerate(category_counts.head(3).items(), 1):
    print(f"{i}. {category}: {count:,} orders ({count/len(ecommerce_data)*100:.1f}%)")

print("\nBy Average Value:")
for i, (category, value) in enumerate(category_avg_value.tail(3).items(), 1):
    print(f"{i}. {category}: R$ {value:.2f}")

print("\nBy Customer Satisfaction:")
category_satisfaction = ecommerce_data.groupby('product_category')['review_score'].mean().sort_values(ascending=False)
for i, (category, rating) in enumerate(category_satisfaction.head(3).items(), 1):
    print(f"{i}. {category}: {rating:.2f}/5.0")

# Strategic quadrants
print("\n🎯 STRATEGIC CATEGORY POSITIONING:")
high_value_threshold = category_summary['avg_value'].median()
high_satisfaction_threshold = category_summary['avg_satisfaction'].median()

stars = category_summary[(category_summary['avg_value'] > high_value_threshold) & 
                        (category_summary['avg_satisfaction'] > high_satisfaction_threshold)]
print(f"⭐ Star Categories (High Value + High Satisfaction): {', '.join(stars['category'].tolist())}")

cash_cows = category_summary[(category_summary['avg_value'] > high_value_threshold) & 
                            (category_summary['avg_satisfaction'] <= high_satisfaction_threshold)]
print(f"💰 Cash Cow Categories (High Value + Lower Satisfaction): {', '.join(cash_cows['category'].tolist())}")

In [None]:
# 3.2 Customer Segment and Geographic Analysis

# Create customer segment performance analysis
fig, axes = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle('Customer Segmentation and Geographic Analysis\nTargeted Strategy Insights', 
             fontsize=16, fontweight='bold', y=0.98)

# Customer segment order value distribution
sns.violinplot(data=ecommerce_data, x='customer_segment', y='order_value', ax=axes[0,0])
axes[0,0].set_title('Order Value by Customer Segment\nSpending Pattern Analysis')
axes[0,0].set_ylabel('Order Value (R$)')
axes[0,0].set_xlabel('Customer Segment')

# Geographic performance (top states)
top_states = ecommerce_data['customer_state'].value_counts().head(8).index
state_data = ecommerce_data[ecommerce_data['customer_state'].isin(top_states)]

state_summary = state_data.groupby('customer_state').agg({
    'order_value': 'mean',
    'review_score': 'mean',
    'order_id': 'count'
}).reset_index()

sns.barplot(data=state_summary, x='customer_state', y='order_value', ax=axes[0,1])
axes[0,1].set_title('Average Order Value by State\nRegional Economic Patterns')
axes[0,1].set_ylabel('Average Order Value (R$)')
axes[0,1].set_xlabel('State')
axes[0,1].tick_params(axis='x', rotation=45)

# Payment method preferences by segment
payment_segment = pd.crosstab(ecommerce_data['customer_segment'], 
                             ecommerce_data['payment_type'], normalize='index')
payment_segment.plot(kind='bar', stacked=True, ax=axes[1,0])
axes[1,0].set_title('Payment Method Preferences\nby Customer Segment')
axes[1,0].set_ylabel('Proportion')
axes[1,0].set_xlabel('Customer Segment')
axes[1,0].legend(title='Payment Type', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1,0].tick_params(axis='x', rotation=0)

# Satisfaction across segments and value tiers
segment_satisfaction = ecommerce_data.groupby(['customer_segment', 'value_tier'])['review_score'].mean().unstack()
sns.heatmap(segment_satisfaction, annot=True, cmap='RdYlBu_r', ax=axes[1,1], 
            cbar_kws={'label': 'Average Review Score'})
axes[1,1].set_title('Satisfaction Matrix\nSegment vs Value Tier')
axes[1,1].set_xlabel('Value Tier')
axes[1,1].set_ylabel('Customer Segment')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

print("👥 CUSTOMER SEGMENTATION INSIGHTS:")
print("=" * 50)

# Segment analysis
segment_stats = ecommerce_data.groupby('customer_segment').agg({
    'order_value': ['mean', 'median', 'count'],
    'review_score': 'mean',
    'delivery_days': 'mean'
}).round(2)

print("\n📊 SEGMENT PERFORMANCE SUMMARY:")
for segment in ['Premium', 'Standard', 'Budget']:
    if segment in segment_stats.index:
        stats = segment_stats.loc[segment]
        print(f"\n{segment} Customers:")
        print(f"  • Average Order: R$ {stats[('order_value', 'mean')]:.2f}")
        print(f"  • Order Count: {stats[('order_value', 'count')]:,}")
        print(f"  • Satisfaction: {stats[('review_score', 'mean')]:.2f}/5.0")
        print(f"  • Avg Delivery: {stats[('delivery_days', 'mean')]:.1f} days")

print("\n🌎 GEOGRAPHIC INSIGHTS:")
print(f"• São Paulo (SP) leads in volume: {(ecommerce_data['customer_state']=='SP').sum():,} orders")
print(f"• Highest AOV state: {state_summary.loc[state_summary['order_value'].idxmax(), 'customer_state']} "
      f"(R$ {state_summary['order_value'].max():.2f})")
print(f"• Most satisfied state: {state_summary.loc[state_summary['review_score'].idxmax(), 'customer_state']} "
      f"({state_summary['review_score'].max():.2f}/5.0)")

print("\n💳 PAYMENT INSIGHTS:")
payment_overall = ecommerce_data['payment_type'].value_counts(normalize=True)
print(f"• Credit card dominance: {payment_overall['credit_card']*100:.1f}% of transactions")
print(f"• Digital payment adoption: {(payment_overall['credit_card'] + payment_overall['debit_card'])*100:.1f}%")

## 4. Figure-Level vs Axes-Level Functions (5 minutes)

### Understanding Seaborn's Architecture

**Key Concepts:**
- **Figure-level functions**: `displot()`, `catplot()`, `relplot()` - Create complete figures with automatic faceting
- **Axes-level functions**: `histplot()`, `boxplot()`, `scatterplot()` - Plot on existing axes for custom layouts
- **When to use each**: Figure-level for exploration, axes-level for custom dashboards

**Business Value:**
- **Rapid Exploration**: Figure-level functions for quick insights
- **Custom Dashboards**: Axes-level for precise business reporting layouts
- **Presentation Ready**: Choose the right approach for your audience

In [None]:
# 4.1 Figure-level Functions for Exploration

print("🔍 FIGURE-LEVEL FUNCTIONS: Powerful Exploration Tools")
print("=" * 60)

# Figure-level: Automatic faceting by category
g1 = sns.displot(data=ecommerce_data, x='order_value', hue='customer_segment', 
                 col='customer_segment', kind='hist', kde=True, 
                 height=4, aspect=1.2)
g1.fig.suptitle('Order Value Distribution by Customer Segment\nFigure-Level Automatic Faceting', 
                y=1.02, fontsize=14, fontweight='bold')
plt.show()

# Figure-level: Multi-dimensional analysis
# Sample data for readability
sample_data = ecommerce_data.sample(2000)
top_categories = ecommerce_data['product_category'].value_counts().head(4).index
category_sample = sample_data[sample_data['product_category'].isin(top_categories)]

g2 = sns.relplot(data=category_sample, x='order_value', y='review_score',
                 col='product_category', hue='customer_segment',
                 kind='scatter', alpha=0.7, height=4, aspect=1)
g2.fig.suptitle('Order Value vs Satisfaction: Multi-Dimensional Analysis\nFaceted by Category, Colored by Segment', 
                y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("\n✅ FIGURE-LEVEL ADVANTAGES:")
print("• Automatic subplot creation and arrangement")
print("• Built-in legend and color coordination")
print("• Perfect for data exploration and discovery")
print("• Minimal code for complex multi-dimensional plots")

In [None]:
# 4.2 Axes-level Functions for Custom Dashboards

print("\n🎯 AXES-LEVEL FUNCTIONS: Precision Dashboard Control")
print("=" * 60)

# Custom dashboard layout using axes-level functions
fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(3, 4, hspace=0.3, wspace=0.3)

# Title
fig.suptitle('E-commerce Executive Dashboard\nCustom Layout with Axes-Level Precision', 
             fontsize=18, fontweight='bold', y=0.96)

# KPI Summary (top row, spans 2 columns)
ax1 = fig.add_subplot(gs[0, :2])
kpi_data = {
    'Metric': ['Total Orders', 'Avg Order Value', 'Customer Satisfaction', 'Avg Delivery Time'],
    'Value': [f"{len(ecommerce_data):,}", 
              f"R$ {ecommerce_data['order_value'].mean():.2f}",
              f"{ecommerce_data['review_score'].mean():.2f}/5.0",
              f"{ecommerce_data['delivery_days'].mean():.1f} days"],
    'Status': ['📈', '💰', '😊', '🚚']
}
kpi_df = pd.DataFrame(kpi_data)
ax1.axis('tight')
ax1.axis('off')
table = ax1.table(cellText=kpi_df.values, colLabels=kpi_df.columns,
                  cellLoc='center', loc='center', colWidths=[0.4, 0.4, 0.2])
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 2)
ax1.set_title('Key Performance Indicators', fontweight='bold', pad=20)

# Revenue trend (top row, spans 2 columns)
ax2 = fig.add_subplot(gs[0, 2:])
monthly_revenue = ecommerce_data.groupby('order_month')['order_value'].sum()
monthly_revenue.plot(kind='line', ax=ax2, linewidth=3, color='#2E8B57')
ax2.set_title('Monthly Revenue Trend', fontweight='bold')
ax2.set_ylabel('Revenue (R$)')
ax2.tick_params(axis='x', rotation=45)

# Category performance (middle row, left)
ax3 = fig.add_subplot(gs[1, :2])
category_revenue = ecommerce_data.groupby('product_category')['order_value'].sum().sort_values(ascending=True)
sns.barplot(x=category_revenue.values, y=category_revenue.index, ax=ax3, palette='viridis')
ax3.set_title('Category Revenue Performance', fontweight='bold')
ax3.set_xlabel('Total Revenue (R$)')

# Customer satisfaction distribution (middle row, right)
ax4 = fig.add_subplot(gs[1, 2:])
satisfaction_counts = ecommerce_data['review_score'].value_counts().sort_index()
sns.barplot(x=satisfaction_counts.index, y=satisfaction_counts.values, ax=ax4, palette='RdYlBu_r')
ax4.set_title('Customer Satisfaction Distribution', fontweight='bold')
ax4.set_xlabel('Review Score')
ax4.set_ylabel('Number of Orders')

# Geographic performance (bottom row, left)
ax5 = fig.add_subplot(gs[2, :2])
state_revenue = ecommerce_data.groupby('customer_state')['order_value'].sum().sort_values(ascending=False).head(8)
sns.barplot(x=state_revenue.values, y=state_revenue.index, ax=ax5, palette='plasma')
ax5.set_title('Top States by Revenue', fontweight='bold')
ax5.set_xlabel('Total Revenue (R$)')

# Delivery performance vs satisfaction (bottom row, right)
ax6 = fig.add_subplot(gs[2, 2:])
sns.boxplot(data=ecommerce_data, x='review_score', y='delivery_days', ax=ax6)
ax6.set_title('Delivery Performance Impact', fontweight='bold')
ax6.set_xlabel('Review Score')
ax6.set_ylabel('Delivery Days')

plt.show()

print("\n✅ AXES-LEVEL ADVANTAGES:")
print("• Complete control over layout and positioning")
print("• Mixed plot types in single dashboard")
print("• Custom spacing and sizing for each element")
print("• Perfect for executive reporting and presentations")
print("• Professional business dashboard aesthetics")

print("\n🎯 WHEN TO USE EACH APPROACH:")
print("\n📊 Figure-Level Functions:")
print("• Data exploration and hypothesis generation")
print("• Quick analysis with automatic faceting")
print("• Statistical relationship discovery")
print("• Academic and research contexts")

print("\n🏢 Axes-Level Functions:")
print("• Executive dashboards and business reports")
print("• Custom layouts with mixed visualizations")
print("• Presentation-ready professional graphics")
print("• Client deliverables and stakeholder communications")

## Summary and Transition to Part 2

### What You've Mastered in Seaborn Statistical Visualization

**🔍 Distribution Analysis:**
- Advanced histogram and KDE techniques for business metrics
- Box plots and violin plots for comparative analysis
- Statistical overlays (mean, median, confidence intervals)
- Log-scale transformations for financial data

**📈 Relationship Analysis:**
- Correlation matrices and heatmaps for metric relationships
- Scatter plots with regression lines and confidence intervals
- Multi-dimensional analysis with color and size encoding
- Statistical significance testing and interpretation

**🎯 Categorical Analysis:**
- Market share and performance analysis by category
- Customer segmentation visualization techniques
- Geographic and demographic pattern analysis
- Strategic positioning matrices and quadrant analysis

**🏗️ Seaborn Architecture:**
- Figure-level vs axes-level function selection
- Automatic faceting for exploration
- Custom dashboard creation for business reporting

### Ready for Part 2: Multi-Plot Figures and Complex Layouts

In the next session, you'll learn to:
- Create sophisticated multi-panel dashboards
- Master GridSpec for complex layouts
- Design executive-level presentation graphics
- Build coordinated visualization systems

**🚀 Your statistical visualization skills are now ready for professional business analytics!**