# Week 7: EDA Techniques - Part 2: Descriptive Statistics and Summary Insights

## Learning Objectives
By the end of this session, you will be able to:
- Apply comprehensive descriptive statistics to business datasets
- Generate meaningful summary insights for stakeholders
- Identify outliers and anomalies in data
- Create automated reporting functions for EDA

## Business Context
Building on our structured EDA framework, we now focus on **extracting quantitative insights** from our Olist e-commerce data. This session emphasizes translating statistical measures into actionable business intelligence.

**Key Business Questions:**
- What are the typical order values and customer behaviors?
- How do our product categories perform financially?
- What outliers or anomalies need attention?
- How can we summarize complex data for executive reporting?

## 1. Environment Setup and Data Loading

In [None]:
# Standard imports for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Database connection
from sqlalchemy import create_engine

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)
pd.set_option('display.float_format', '{:.2f}'.format)

# Enhanced plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✅ Environment setup complete for descriptive statistics analysis!")

In [None]:
# Supabase connection
DATABASE_URL = "postgresql://postgres.pzykoxdiwsyclwfqfiii:L3tMeQuery123!@aws-0-us-east-1.pooler.supabase.com:6543/postgres"
engine = create_engine(DATABASE_URL)

print("🔄 Loading comprehensive datasets for analysis...")

# Load main datasets with business focus
orders_query = """
SELECT 
    o.*,
    EXTRACT(YEAR FROM order_purchase_timestamp) as order_year,
    EXTRACT(MONTH FROM order_purchase_timestamp) as order_month,
    EXTRACT(DOW FROM order_purchase_timestamp) as order_dow
FROM olist_sales_data_set.olist_orders_dataset o
WHERE order_status = 'delivered'
LIMIT 10000
"""

order_items_query = """
SELECT 
    oi.*,
    (oi.price + oi.freight_value) as total_item_value
FROM olist_sales_data_set.olist_order_items_dataset oi
LIMIT 15000
"""

# Load product data with category translations
products_query = """
SELECT 
    p.*,
    COALESCE(t.product_category_name_english, p.product_category_name) as category_english
FROM olist_sales_data_set.olist_products_dataset p
LEFT JOIN olist_sales_data_set.product_category_name_translation t
    ON p.product_category_name = t.product_category_name
"""

# Execute queries
orders_df = pd.read_sql(orders_query, engine)
order_items_df = pd.read_sql(order_items_query, engine)
products_df = pd.read_sql(products_query, engine)

print(f"✅ Data loaded:")
print(f"   📦 Orders: {len(orders_df):,} delivered orders")
print(f"   🛒 Order Items: {len(order_items_df):,} line items")
print(f"   📋 Products: {len(products_df):,} unique products")

## 2. Business Dataset Creation

Let's create a comprehensive business dataset by joining our tables for deeper analysis.

In [None]:
# Create comprehensive business dataset
print("🔧 Creating comprehensive business dataset...")

# Merge order items with product information
business_data = order_items_df.merge(
    products_df[['product_id', 'category_english', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']], 
    on='product_id', 
    how='left'
)

# Merge with orders information
business_data = business_data.merge(
    orders_df[['order_id', 'order_year', 'order_month', 'order_dow', 'order_purchase_timestamp']], 
    on='order_id', 
    how='inner'
)

# Calculate additional business metrics
business_data['profit_margin'] = (business_data['price'] - business_data['freight_value']) / business_data['price']
business_data['freight_ratio'] = business_data['freight_value'] / business_data['price']
business_data['product_volume'] = (
    business_data['product_length_cm'] * 
    business_data['product_height_cm'] * 
    business_data['product_width_cm']
) / 1000  # Convert to liters

# Clean category names
business_data['category_clean'] = business_data['category_english'].fillna('Unknown').str.title()

print(f"✅ Business dataset created with {len(business_data):,} records")
print(f"   📊 Columns: {business_data.shape[1]}")
print(f"   🏷️ Product categories: {business_data['category_clean'].nunique()}")

# Display sample
print("\n📋 Sample Business Data:")
display(business_data[['order_id', 'product_id', 'category_clean', 'price', 'freight_value', 'total_item_value', 'profit_margin']].head())

## 3. Comprehensive Descriptive Statistics

Now let's dive deep into descriptive statistics to understand our business metrics.

In [None]:
# Financial Metrics Analysis
print("💰 Financial Metrics - Descriptive Statistics")
print("=" * 50)

# Key financial columns
financial_cols = ['price', 'freight_value', 'total_item_value', 'profit_margin', 'freight_ratio']

def enhanced_describe(df, columns, title):
    """
    Enhanced descriptive statistics function
    """
    print(f"\n📊 {title}")
    print("-" * 40)
    
    stats_df = pd.DataFrame({
        'Mean': df[columns].mean(),
        'Median': df[columns].median(),
        'Std Dev': df[columns].std(),
        'Min': df[columns].min(),
        'Max': df[columns].max(),
        'Q1': df[columns].quantile(0.25),
        'Q3': df[columns].quantile(0.75),
        'IQR': df[columns].quantile(0.75) - df[columns].quantile(0.25),
        'Skewness': df[columns].skew(),
        'Kurtosis': df[columns].kurtosis()
    })
    
    return stats_df

# Financial statistics
financial_stats = enhanced_describe(business_data, financial_cols, "Financial Metrics Summary")
display(financial_stats.round(4))

# Business insights from financial data
avg_order_value = business_data['total_item_value'].mean()
median_order_value = business_data['total_item_value'].median()
avg_freight_ratio = business_data['freight_ratio'].mean() * 100
avg_profit_margin = business_data['profit_margin'].mean() * 100

print(f"\n💡 Key Financial Insights:")
print(f"   • Average order value: R$ {avg_order_value:.2f}")
print(f"   • Median order value: R$ {median_order_value:.2f}")
print(f"   • Average freight ratio: {avg_freight_ratio:.1f}% of item price")
print(f"   • Average profit margin: {avg_profit_margin:.1f}%")

In [None]:
# Product Category Performance Analysis
print("🏷️ Product Category Performance Analysis")
print("=" * 45)

# Category-level statistics
category_stats = business_data.groupby('category_clean').agg({
    'price': ['count', 'mean', 'median', 'std'],
    'total_item_value': ['mean', 'sum'],
    'freight_value': 'mean',
    'profit_margin': 'mean',
    'product_weight_g': 'mean'
}).round(2)

# Flatten column names
category_stats.columns = [f'{col[0]}_{col[1]}' if col[1] else col[0] for col in category_stats.columns]

# Add total revenue per category
category_stats = category_stats.sort_values('total_item_value_sum', ascending=False)

print("\n🏆 Top 10 Categories by Total Revenue:")
top_categories = category_stats.head(10)
display(top_categories[['price_count', 'price_mean', 'total_item_value_sum', 'profit_margin_mean']])

# Category insights
most_popular_category = category_stats['price_count'].idxmax()
highest_revenue_category = category_stats['total_item_value_sum'].idxmax()
highest_margin_category = category_stats['profit_margin_mean'].idxmax()
highest_avg_price_category = category_stats['price_mean'].idxmax()

print(f"\n📈 Category Performance Insights:")
print(f"   • Most popular category: {most_popular_category} ({category_stats.loc[most_popular_category, 'price_count']:,} orders)")
print(f"   • Highest revenue category: {highest_revenue_category} (R$ {category_stats.loc[highest_revenue_category, 'total_item_value_sum']:,.2f})")
print(f"   • Best profit margin: {highest_margin_category} ({category_stats.loc[highest_margin_category, 'profit_margin_mean']*100:.1f}%)")
print(f"   • Highest average price: {highest_avg_price_category} (R$ {category_stats.loc[highest_avg_price_category, 'price_mean']:.2f})")

## 4. Distribution Analysis and Visualization

In [None]:
# Price Distribution Analysis
print("💸 Price Distribution Analysis")
print("=" * 35)

# Create comprehensive price distribution plots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Price Distribution Analysis', fontsize=16, fontweight='bold')

# Histogram with KDE
axes[0, 0].hist(business_data['price'], bins=50, alpha=0.7, color='skyblue', density=True)
business_data['price'].plot(kind='kde', ax=axes[0, 0], color='red', linewidth=2)
axes[0, 0].set_title('Price Distribution (Histogram + KDE)')
axes[0, 0].set_xlabel('Price (R$)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].grid(True, alpha=0.3)

# Box plot
axes[0, 1].boxplot(business_data['price'], vert=True)
axes[0, 1].set_title('Price Distribution (Box Plot)')
axes[0, 1].set_ylabel('Price (R$)')
axes[0, 1].grid(True, alpha=0.3)

# Log-scale histogram
log_prices = np.log1p(business_data['price'])
axes[1, 0].hist(log_prices, bins=50, alpha=0.7, color='lightgreen')
axes[1, 0].set_title('Log-Transformed Price Distribution')
axes[1, 0].set_xlabel('Log(Price + 1)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].grid(True, alpha=0.3)

# Q-Q plot for normality assessment
stats.probplot(business_data['price'], dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Price vs Normal Distribution)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests for normality
from scipy.stats import jarque_bera, shapiro

# Sample for Shapiro-Wilk test (max 5000 samples)
price_sample = business_data['price'].sample(min(5000, len(business_data)))

jb_stat, jb_pvalue = jarque_bera(business_data['price'])
sw_stat, sw_pvalue = shapiro(price_sample)

print(f"\n📊 Normality Tests for Price Distribution:")
print(f"   • Jarque-Bera Test: statistic = {jb_stat:.2f}, p-value = {jb_pvalue:.2e}")
print(f"   • Shapiro-Wilk Test: statistic = {sw_stat:.4f}, p-value = {sw_pvalue:.2e}")
print(f"   • Interpretation: {'Not normally distributed' if jb_pvalue < 0.05 else 'Potentially normally distributed'} (α = 0.05)")

# Price percentiles
price_percentiles = business_data['price'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
print(f"\n💰 Price Percentiles:")
for percentile, value in price_percentiles.items():
    print(f"   • {percentile*100:2.0f}th percentile: R$ {value:.2f}")

In [None]:
# Outlier Detection and Analysis
print("🔍 Outlier Detection and Analysis")
print("=" * 40)

def detect_outliers(data, method='iqr'):
    """
    Detect outliers using different methods
    """
    if method == 'iqr':
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        return data[(data < lower_bound) | (data > upper_bound)]
    
    elif method == 'zscore':
        z_scores = np.abs(stats.zscore(data))
        return data[z_scores > 3]
    
    elif method == 'modified_zscore':
        median = data.median()
        mad = np.median(np.abs(data - median))
        modified_z_scores = 0.6745 * (data - median) / mad
        return data[np.abs(modified_z_scores) > 3.5]

# Detect outliers in price data
price_outliers_iqr = detect_outliers(business_data['price'], 'iqr')
price_outliers_zscore = detect_outliers(business_data['price'], 'zscore')
price_outliers_modified = detect_outliers(business_data['price'], 'modified_zscore')

print(f"\n🎯 Outlier Detection Results for Price:")
print(f"   • IQR Method: {len(price_outliers_iqr):,} outliers ({len(price_outliers_iqr)/len(business_data)*100:.2f}%)")
print(f"   • Z-Score Method: {len(price_outliers_zscore):,} outliers ({len(price_outliers_zscore)/len(business_data)*100:.2f}%)")
print(f"   • Modified Z-Score: {len(price_outliers_modified):,} outliers ({len(price_outliers_modified)/len(business_data)*100:.2f}%)")

# Analyze outlier characteristics
if len(price_outliers_iqr) > 0:
    outlier_indices = business_data[business_data['price'].isin(price_outliers_iqr)].index
    outlier_categories = business_data.loc[outlier_indices, 'category_clean'].value_counts().head(5)
    
    print(f"\n📊 Top 5 Categories with High-Price Outliers:")
    for category, count in outlier_categories.items():
        print(f"   • {category}: {count} outliers")
    
    print(f"\n💰 Outlier Price Statistics:")
    print(f"   • Minimum outlier price: R$ {price_outliers_iqr.min():.2f}")
    print(f"   • Maximum outlier price: R$ {price_outliers_iqr.max():.2f}")
    print(f"   • Average outlier price: R$ {price_outliers_iqr.mean():.2f}")

# Visualize outliers
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.boxplot(business_data['price'])
plt.title('Price Outliers\n(Box Plot)')
plt.ylabel('Price (R$)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.scatter(range(len(business_data)), business_data['price'], alpha=0.5, s=1)
plt.axhline(y=business_data['price'].quantile(0.75) + 1.5*(business_data['price'].quantile(0.75) - business_data['price'].quantile(0.25)), 
           color='red', linestyle='--', label='Upper Outlier Threshold')
plt.title('Price Scatter Plot\nwith Outlier Threshold')
plt.xlabel('Data Point Index')
plt.ylabel('Price (R$)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
business_data['price'].plot(kind='hist', bins=50, alpha=0.7, color='lightcoral')
plt.axvline(x=price_outliers_iqr.min(), color='red', linestyle='--', label='Outlier Threshold')
plt.title('Price Distribution\nwith Outlier Boundary')
plt.xlabel('Price (R$)')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Correlation Analysis

In [None]:
# Correlation Analysis
print("🔗 Correlation Analysis")
print("=" * 25)

# Select numeric columns for correlation analysis
numeric_cols = ['price', 'freight_value', 'total_item_value', 'profit_margin', 
                'freight_ratio', 'product_weight_g', 'product_volume']

# Remove rows with missing values for correlation analysis
correlation_data = business_data[numeric_cols].dropna()

print(f"📊 Analyzing correlations for {len(correlation_data):,} complete records")

# Calculate correlation matrices
pearson_corr = correlation_data.corr(method='pearson')
spearman_corr = correlation_data.corr(method='spearman')

# Visualize correlation matrices
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Pearson correlation heatmap
mask = np.triu(np.ones_like(pearson_corr, dtype=bool))
sns.heatmap(pearson_corr, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.3f', cbar_kws={"shrink": .8}, ax=axes[0])
axes[0].set_title('Pearson Correlation Matrix\n(Linear Relationships)', fontweight='bold')

# Spearman correlation heatmap
sns.heatmap(spearman_corr, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.3f', cbar_kws={"shrink": .8}, ax=axes[1])
axes[1].set_title('Spearman Correlation Matrix\n(Monotonic Relationships)', fontweight='bold')

plt.tight_layout()
plt.show()

# Identify strong correlations
def find_strong_correlations(corr_matrix, threshold=0.5):
    """
    Find pairs of variables with strong correlations
    """
    strong_corrs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            corr_value = corr_matrix.iloc[i, j]
            if abs(corr_value) >= threshold:
                strong_corrs.append({
                    'Variable 1': corr_matrix.columns[i],
                    'Variable 2': corr_matrix.columns[j],
                    'Correlation': corr_value
                })
    return pd.DataFrame(strong_corrs).sort_values('Correlation', key=abs, ascending=False)

strong_pearson = find_strong_correlations(pearson_corr, 0.5)
strong_spearman = find_strong_correlations(spearman_corr, 0.5)

print(f"\n🔍 Strong Pearson Correlations (|r| ≥ 0.5):")
if not strong_pearson.empty:
    display(strong_pearson)
else:
    print("   No strong linear correlations found.")

print(f"\n🔍 Strong Spearman Correlations (|ρ| ≥ 0.5):")
if not strong_spearman.empty:
    display(strong_spearman)
else:
    print("   No strong monotonic correlations found.")

# Business interpretation of correlations
print(f"\n💡 Business Insights from Correlations:")
price_freight_corr = pearson_corr.loc['price', 'freight_value']
price_volume_corr = pearson_corr.loc['price', 'product_volume'] if 'product_volume' in pearson_corr.columns else 0

print(f"   • Price-Freight correlation: {price_freight_corr:.3f}")
if abs(price_freight_corr) > 0.3:
    print(f"     → Moderate relationship between item price and shipping cost")
else:
    print(f"     → Weak relationship between item price and shipping cost")

if 'product_volume' in correlation_data.columns:
    print(f"   • Price-Volume correlation: {price_volume_corr:.3f}")
    if abs(price_volume_corr) > 0.3:
        print(f"     → Product size influences pricing")
    else:
        print(f"     → Product size has little impact on pricing")

## 6. Executive Summary Dashboard

In [None]:
# Executive Summary Dashboard
print("📈 EXECUTIVE SUMMARY - OLIST E-COMMERCE ANALYSIS")
print("=" * 60)

# Key Performance Indicators
total_revenue = business_data['total_item_value'].sum()
total_orders = business_data['order_id'].nunique()
total_products_sold = len(business_data)
avg_order_value = business_data.groupby('order_id')['total_item_value'].sum().mean()
unique_products = business_data['product_id'].nunique()
unique_categories = business_data['category_clean'].nunique()

print(f"\n📊 KEY PERFORMANCE INDICATORS:")
print(f"   💰 Total Revenue: R$ {total_revenue:,.2f}")
print(f"   📦 Total Orders: {total_orders:,}")
print(f"   🛒 Total Products Sold: {total_products_sold:,}")
print(f"   💵 Average Order Value: R$ {avg_order_value:.2f}")
print(f"   📋 Unique Products: {unique_products:,}")
print(f"   🏷️ Product Categories: {unique_categories}")

# Top performers
print(f"\n🏆 TOP PERFORMERS:")
top_category_by_revenue = business_data.groupby('category_clean')['total_item_value'].sum().idxmax()
top_category_revenue = business_data.groupby('category_clean')['total_item_value'].sum().max()
print(f"   🥇 Top Category by Revenue: {top_category_by_revenue} (R$ {top_category_revenue:,.2f})")

most_expensive_category = business_data.groupby('category_clean')['price'].mean().idxmax()
highest_avg_price = business_data.groupby('category_clean')['price'].mean().max()
print(f"   💎 Highest Average Price Category: {most_expensive_category} (R$ {highest_avg_price:.2f})")

# Risk indicators
print(f"\n⚠️ RISK INDICATORS:")
high_freight_orders = (business_data['freight_ratio'] > 0.3).sum()
high_freight_pct = (high_freight_orders / len(business_data)) * 100
print(f"   🚛 High Freight Ratio Orders: {high_freight_orders:,} ({high_freight_pct:.1f}%)")
print(f"   📊 Price Distribution Skewness: {business_data['price'].skew():.2f} (highly right-skewed)")

# Recommendations
print(f"\n🎯 STRATEGIC RECOMMENDATIONS:")
print(f"   1. Focus marketing efforts on top-performing category: {top_category_by_revenue}")
print(f"   2. Investigate {high_freight_pct:.1f}% of orders with high freight costs")
print(f"   3. Consider premium pricing strategy for {most_expensive_category} category")
print(f"   4. Implement outlier detection system for price anomalies")
print(f"   5. Optimize product mix based on profit margin analysis")

# Create summary visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Executive Summary Dashboard', fontsize=16, fontweight='bold')

# Revenue by category (top 10)
category_revenue = business_data.groupby('category_clean')['total_item_value'].sum().sort_values(ascending=False).head(10)
category_revenue.plot(kind='bar', ax=axes[0, 0], color='steelblue')
axes[0, 0].set_title('Top 10 Categories by Revenue')
axes[0, 0].set_xlabel('Category')
axes[0, 0].set_ylabel('Revenue (R$)')
axes[0, 0].tick_params(axis='x', rotation=45)

# Order value distribution
order_values = business_data.groupby('order_id')['total_item_value'].sum()
axes[0, 1].hist(order_values, bins=50, alpha=0.7, color='lightcoral')
axes[0, 1].axvline(order_values.mean(), color='red', linestyle='--', label=f'Mean: R${order_values.mean():.2f}')
axes[0, 1].set_title('Order Value Distribution')
axes[0, 1].set_xlabel('Order Value (R$)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# Freight ratio distribution
axes[1, 0].hist(business_data['freight_ratio'], bins=50, alpha=0.7, color='lightgreen')
axes[1, 0].axvline(business_data['freight_ratio'].mean(), color='red', linestyle='--', 
                  label=f'Mean: {business_data["freight_ratio"].mean()*100:.1f}%')
axes[1, 0].set_title('Freight Ratio Distribution')
axes[1, 0].set_xlabel('Freight as % of Price')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Monthly revenue trend
monthly_revenue = business_data.groupby('order_month')['total_item_value'].sum()
monthly_revenue.plot(kind='line', marker='o', ax=axes[1, 1], color='purple', linewidth=2)
axes[1, 1].set_title('Monthly Revenue Trend')
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Revenue (R$)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Automated EDA Report Function

In [None]:
def generate_eda_report(dataframe, target_column=None, categorical_threshold=10):
    """
    Generate comprehensive EDA report for any dataset
    
    Parameters:
    -----------
    dataframe : pd.DataFrame
        The dataset to analyze
    target_column : str, optional
        Name of target variable for supervised learning analysis
    categorical_threshold : int
        Maximum unique values to consider a column categorical
    """
    
    print(f"🔍 AUTOMATED EDA REPORT")
    print(f"{'='*50}")
    
    # Dataset overview
    print(f"\n📊 DATASET OVERVIEW:")
    print(f"   Shape: {dataframe.shape[0]:,} rows × {dataframe.shape[1]} columns")
    print(f"   Memory usage: {dataframe.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    # Data types summary
    print(f"\n📋 DATA TYPES:")
    dtype_counts = dataframe.dtypes.value_counts()
    for dtype, count in dtype_counts.items():
        print(f"   {dtype}: {count} columns")
    
    # Missing values analysis
    print(f"\n❓ MISSING VALUES:")
    missing_summary = dataframe.isnull().sum()
    missing_pct = (missing_summary / len(dataframe)) * 100
    
    if missing_summary.sum() > 0:
        missing_df = pd.DataFrame({
            'Missing Count': missing_summary[missing_summary > 0],
            'Missing %': missing_pct[missing_summary > 0]
        }).sort_values('Missing %', ascending=False)
        display(missing_df)
    else:
        print("   ✅ No missing values found!")
    
    # Identify column types
    numeric_cols = dataframe.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = []
    
    for col in dataframe.columns:
        if col not in numeric_cols:
            if dataframe[col].nunique() <= categorical_threshold:
                categorical_cols.append(col)
    
    # Numeric variables analysis
    if numeric_cols:
        print(f"\n📈 NUMERIC VARIABLES SUMMARY:")
        numeric_summary = dataframe[numeric_cols].describe()
        display(numeric_summary)
        
        # Skewness analysis
        skewness = dataframe[numeric_cols].skew()
        print(f"\n📊 SKEWNESS ANALYSIS:")
        for col, skew_val in skewness.items():
            if abs(skew_val) > 1:
                skew_type = "highly skewed"
            elif abs(skew_val) > 0.5:
                skew_type = "moderately skewed"
            else:
                skew_type = "approximately normal"
            print(f"   {col}: {skew_val:.2f} ({skew_type})")
    
    # Categorical variables analysis
    if categorical_cols:
        print(f"\n🏷️ CATEGORICAL VARIABLES SUMMARY:")
        for col in categorical_cols:
            unique_count = dataframe[col].nunique()
            most_common = dataframe[col].mode().iloc[0] if len(dataframe[col].mode()) > 0 else 'N/A'
            print(f"   {col}: {unique_count} unique values, most common: '{most_common}'")
    
    # Correlation analysis for numeric variables
    if len(numeric_cols) > 1:
        print(f"\n🔗 CORRELATION ANALYSIS:")
        corr_matrix = dataframe[numeric_cols].corr()
        
        # Find strong correlations
        strong_corrs = []
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                corr_value = corr_matrix.iloc[i, j]
                if abs(corr_value) >= 0.5:
                    strong_corrs.append({
                        'Variable 1': corr_matrix.columns[i],
                        'Variable 2': corr_matrix.columns[j],
                        'Correlation': corr_value
                    })
        
        if strong_corrs:
            strong_corr_df = pd.DataFrame(strong_corrs).sort_values('Correlation', key=abs, ascending=False)
            display(strong_corr_df)
        else:
            print("   No strong correlations (|r| ≥ 0.5) found.")
    
    # Outlier analysis for numeric variables
    if numeric_cols:
        print(f"\n🎯 OUTLIER ANALYSIS (IQR Method):")
        for col in numeric_cols:
            Q1 = dataframe[col].quantile(0.25)
            Q3 = dataframe[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = dataframe[(dataframe[col] < lower_bound) | (dataframe[col] > upper_bound)]
            outlier_pct = (len(outliers) / len(dataframe)) * 100
            print(f"   {col}: {len(outliers):,} outliers ({outlier_pct:.2f}%)")
    
    print(f"\n✅ EDA Report Complete!")
    return {
        'numeric_columns': numeric_cols,
        'categorical_columns': categorical_cols,
        'missing_summary': missing_summary,
        'correlation_matrix': corr_matrix if len(numeric_cols) > 1 else None
    }

# Test the automated EDA function
print("🧪 Testing Automated EDA Function on Sample Data")
print("=" * 50)

# Create sample dataset
sample_data = business_data[['price', 'freight_value', 'total_item_value', 
                           'category_clean', 'profit_margin', 'order_year']].sample(1000)

eda_results = generate_eda_report(sample_data)

## Summary and Conclusions

### What We've Accomplished

1. **✅ Comprehensive Descriptive Statistics**: Applied advanced statistical measures to understand our business data
2. **✅ Business-Focused Analysis**: Translated statistical insights into actionable business intelligence
3. **✅ Outlier Detection**: Identified anomalies that require business attention
4. **✅ Correlation Analysis**: Discovered relationships between key business metrics
5. **✅ Executive Dashboard**: Created summary visualizations for stakeholder reporting
6. **✅ Automated EDA Function**: Built reusable tools for future analysis

### Key Business Insights

**Financial Performance:**
- Clear understanding of order value distributions and pricing patterns
- Identification of high and low-performing product categories
- Freight cost analysis revealing potential optimization opportunities

**Risk Management:**
- Systematic outlier detection for price anomalies
- Statistical validation of data quality
- Identification of categories requiring attention

**Strategic Recommendations:**
- Data-driven category performance insights
- Pricing strategy recommendations based on statistical analysis
- Operational efficiency opportunities in freight management

### Next Steps
In Part 3, we'll explore:
- Advanced distribution analysis techniques
- Deep correlation exploration with visualization
- Statistical hypothesis testing for business questions

## 🎯 Practice Exercises - Part 2

Apply your descriptive statistics knowledge:

1. **Category Deep Dive**: Choose a product category and perform comprehensive descriptive analysis

2. **Profit Margin Analysis**: Calculate and analyze profit margins by different business dimensions

3. **Custom EDA Function**: Enhance the automated EDA function with additional statistical tests

4. **Business Metric Creation**: Define and calculate new business KPIs from the available data

In [None]:
# Exercise space for Part 2 - Descriptive Statistics

# Exercise 1: Category Deep Dive
# Choose a category and analyze its complete statistical profile

# Exercise 2: Profit Margin Analysis
# Calculate profit margins across different dimensions

# Exercise 3: Enhanced EDA Function
# Add statistical tests to the automated EDA function

# Exercise 4: Business KPI Creation
# Define new metrics relevant to e-commerce business