# Session 3: Exploratory Data Analysis (EDA) and Pattern Discovery

**Module:** Data Insights and Visualization  
**Level:** 7 | **Credits:** 10  
**Learning Outcomes Addressed:** LO1, LO3  
**Big Academy Saudi Arabia - Riyadh Campus** 🇸🇦

---

![Big Academy](https://img.shields.io/badge/Big%20Academy-Saudi%20Arabia-green?style=for-the-badge)
![Level](https://img.shields.io/badge/Level-7%20Master's-blue?style=for-the-badge)
![Session](https://img.shields.io/badge/Session-3%20of%208-orange?style=for-the-badge)

<!-- CELL BREAK -->

## 📋 Session Overview

Welcome to Session 3! Now that we have clean, prepared data from Session 2, we can begin the exciting journey of **Exploratory Data Analysis (EDA)**. This is where we discover hidden patterns, relationships, and insights that will drive business decisions. EDA is both an art and a science - combining statistical rigor with creative investigation.

### 🎯 Learning Objectives
#### By the end of this session, you will be able to:

- **Calculate and interpret descriptive statistics for business insights**
- **Discover relationships and correlations between variables**
- **Analyze data distributions and identify patterns**
- **Create professional visualizations for pattern discovery**
- **Perform time series analysis and trend identification**
- **Conduct comparative analysis and benchmarking**
- **Apply statistical tests to validate findings**
- **Generate executive-ready insights and recommendations**

<!-- CELL BREAK -->

### 📚 Learning Outcomes Alignment
- **LO1:** Apply statistical and programming techniques to analyse complex structured and unstructured datasets
- **LO3:** Critically interpret data patterns and trends for effective communication and decision-making

**📍 Session 3 Focus: Pattern Discovery & Statistical Analysis**

### ⏱️ Session Structure (3 Hours)
- **Part 1:** Descriptive Statistics & Data Summarization (45 minutes)
- **Part 2:** Correlation Analysis & Relationships (45 minutes)  
- **Part 3:** Distribution Analysis & Statistical Testing (45 minutes)
- **Part 4:** Advanced Visualization for Pattern Discovery (30 minutes)
- **Part 5:** Time Series Analysis & Trends (30 minutes)
- **Part 6:** Comparative Analysis & Business Insights (15 minutes)

<!-- CELL BREAK -->

---
## 📊 Part 1: Descriptive Statistics and Data Summarization

<!-- CELL BREAK -->

### 1.1 Understanding Descriptive Statistics

**What are Descriptive Statistics?**
Descriptive statistics summarize and describe the main features of a dataset. They provide simple summaries about the data and help us understand the basic characteristics of our variables.

**Why are they Important?**
- Provide quick insights into data characteristics
- Help identify data quality issues
- Form the foundation for deeper analysis
- Essential for business reporting and communication

#### 📋 Types of Descriptive Statistics

| **Category** | **Measures** | **Purpose** | **When to Use** |
|--------------|--------------|-------------|-----------------|
| **Central Tendency** | Mean, Median, Mode | Find the "center" of data | Compare average performance, identify typical values |
| **Dispersion** | Range, Variance, Standard Deviation | Measure data spread | Assess consistency, identify variability |
| **Shape** | Skewness, Kurtosis | Describe distribution shape | Understand data distribution characteristics |
| **Position** | Quartiles, Percentiles | Find data positions | Identify outliers, create benchmarks |

<!-- CELL BREAK -->

```python
# Session 3 Environment Setup
print("📊 SESSION 3: EXPLORATORY DATA ANALYSIS (EDA)")
print("Big Academy Saudi Arabia - Level 7 Master's Program")
print("="*70)

# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization parameters
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Session information
session_info = {
    "Session": "3 - Exploratory Data Analysis",
    "Duration": "3 hours",
    "Focus": "LO1 & LO3", 
    "Key Skills": "Statistics, Visualization, Pattern Discovery",
    "Tools": "Pandas, Matplotlib, Seaborn, SciPy"
}

print("\n📋 SESSION INFORMATION:")
for key, value in session_info.items():
    print(f"  • {key}: {value}")

print("\n✅ Environment ready for EDA!")
```

<!-- CELL BREAK -->

### 1.2 Creating Sample Dataset for Analysis

Let's create a comprehensive business dataset to practice our EDA techniques.

<!-- CELL BREAK -->

```python
# Create comprehensive business dataset for EDA practice
print("🏗️ CREATING COMPREHENSIVE BUSINESS DATASET FOR EDA:\n")

# Set random seed for reproducibility
np.random.seed(42)

# Generate realistic e-commerce business data
n_customers = 2000
n_products = 50
n_transactions = 8000

# Customer data
customer_ages = np.random.normal(35, 12, n_customers)
customer_ages = np.clip(customer_ages, 18, 70).round().astype(int)

customer_cities = np.random.choice(['Riyadh', 'Jeddah', 'Dammam', 'Mecca', 'Medina', 'Tabuk'], 
                                  n_customers, p=[0.35, 0.25, 0.15, 0.1, 0.1, 0.05])

customer_segments = np.random.choice(['Premium', 'Standard', 'Basic'], 
                                   n_customers, p=[0.2, 0.5, 0.3])

customers_df = pd.DataFrame({
    'customer_id': range(1001, 1001 + n_customers),
    'age': customer_ages,
    'city': customer_cities,
    'segment': customer_segments,
    'registration_date': pd.date_range('2022-01-01', periods=n_customers, freq='6H')
})

# Product data  
product_categories = ['Electronics', 'Fashion', 'Home', 'Sports', 'Books']
product_prices = {
    'Electronics': np.random.gamma(2, 300),
    'Fashion': np.random.gamma(1.5, 150), 
    'Home': np.random.gamma(2, 200),
    'Sports': np.random.gamma(1.8, 120),
    'Books': np.random.gamma(1, 40)
}

products_df = pd.DataFrame({
    'product_id': range(2001, 2001 + n_products),
    'category': np.random.choice(product_categories, n_products),
    'price': [product_prices[cat] for cat in np.random.choice(product_categories, n_products)]
})

# Transaction data
transaction_dates = pd.date_range('2023-01-01', '2024-12-31', freq='H')
selected_dates = np.random.choice(transaction_dates, n_transactions)

# Create transactions with realistic patterns
transactions_df = pd.DataFrame({
    'transaction_id': range(5001, 5001 + n_transactions),
    'customer_id': np.random.choice(customers_df['customer_id'], n_transactions),
    'product_id': np.random.choice(products_df['product_id'], n_transactions),
    'transaction_date': selected_dates,
    'quantity': np.random.poisson(2, n_transactions) + 1,  # At least 1
    'discount_percent': np.random.choice([0, 5, 10, 15, 20], n_transactions, p=[0.4, 0.25, 0.2, 0.1, 0.05])
})

# Merge to create comprehensive dataset
ecommerce_data = transactions_df.merge(customers_df, on='customer_id', how='left')
ecommerce_data = ecommerce_data.merge(products_df, on='product_id', how='left')

# Calculate derived fields
ecommerce_data['subtotal'] = ecommerce_data['price'] * ecommerce_data['quantity']
ecommerce_data['discount_amount'] = ecommerce_data['subtotal'] * (ecommerce_data['discount_percent'] / 100)
ecommerce_data['total_amount'] = ecommerce_data['subtotal'] - ecommerce_data['discount_amount']

# Add time-based features
ecommerce_data['month'] = ecommerce_data['transaction_date'].dt.month
ecommerce_data['weekday'] = ecommerce_data['transaction_date'].dt.weekday
ecommerce_data['hour'] = ecommerce_data['transaction_date'].dt.hour

print(f"✅ Created comprehensive e-commerce dataset: {ecommerce_data.shape}")
print(f"📊 Dataset covers: {len(customers_df)} customers, {len(products_df)} products, {len(transactions_df)} transactions")

# Display sample data
print("\n📋 Sample E-commerce Data:")
print(ecommerce_data[['customer_id', 'age', 'city', 'category', 'price', 'quantity', 'total_amount', 'transaction_date']].head(10))
```

<!-- CELL BREAK -->

### 1.3 Central Tendency Measures

**What is Central Tendency?**
Central tendency describes the center or typical value of a dataset. It helps us understand what a "normal" or "average" value looks like.

#### 📊 Central Tendency Methods Comparison

| **Method** | **Description** | **When to Use** | **Advantages** | **Disadvantages** |
|------------|-----------------|-----------------|----------------|-------------------|
| **Mean** | Average of all values | Normal distributions, no extreme outliers | Easy to calculate, uses all data | Affected by outliers |
| **Median** | Middle value when sorted | Skewed distributions, with outliers | Not affected by outliers | Ignores extreme values |
| **Mode** | Most frequently occurring value | Categorical data, discrete values | Shows most common value | May not exist or be multiple |

<!-- CELL BREAK -->

```python
# Central Tendency Analysis
print("📊 CENTRAL TENDENCY ANALYSIS:")
print("="*50)

# Select numerical columns for analysis
numerical_columns = ['age', 'price', 'quantity', 'total_amount']

print("\n1️⃣ MEAN (AVERAGE) ANALYSIS:")
print("   Purpose: Find the average value")
print("   Best for: Normal distributions, no extreme outliers")

for col in numerical_columns:
    mean_value = ecommerce_data[col].mean()
    print(f"   • {col}: {mean_value:.2f}")

print("\n2️⃣ MEDIAN (MIDDLE VALUE) ANALYSIS:")
print("   Purpose: Find the middle value when data is sorted")  
print("   Best for: Skewed distributions, data with outliers")

for col in numerical_columns:
    median_value = ecommerce_data[col].median()
    print(f"   • {col}: {median_value:.2f}")

print("\n3️⃣ MODE (MOST FREQUENT) ANALYSIS:")
print("   Purpose: Find the most commonly occurring value")
print("   Best for: Categorical data, discrete values")

# Mode for categorical columns
categorical_columns = ['city', 'segment', 'category']
for col in categorical_columns:
    mode_value = ecommerce_data[col].mode().iloc[0] if len(ecommerce_data[col].mode()) > 0 else 'No mode'
    print(f"   • {col}: {mode_value}")

# Mode for numerical columns (rounded for meaningful results)
for col in numerical_columns:
    mode_value = ecommerce_data[col].round().mode()
    if len(mode_value) > 0:
        print(f"   • {col}: {mode_value.iloc[0]}")
```

<!-- CELL BREAK -->

### 1.4 Dispersion Measures

**What is Dispersion?**
Dispersion measures describe how spread out or scattered the data values are. They help us understand the variability and consistency in our data.

#### 📊 Dispersion Methods Comparison

| **Method** | **Description** | **When to Use** | **Interpretation** |
|------------|-----------------|-----------------|-------------------|
| **Range** | Difference between max and min | Quick variability check | Larger range = more spread |
| **Variance** | Average squared deviation from mean | Statistical calculations | Higher variance = more variability |
| **Standard Deviation** | Square root of variance | Compare variability between datasets | Same units as original data |
| **Coefficient of Variation** | Standard deviation / mean | Compare relative variability | Higher CV = more relative variability |

<!-- CELL BREAK -->

```python
# Dispersion Analysis
print("\n📈 DISPERSION (VARIABILITY) ANALYSIS:")
print("="*50)

print("\n1️⃣ RANGE ANALYSIS:")
print("   Purpose: Shows the spread between minimum and maximum values")
print("   Formula: Maximum - Minimum")

for col in numerical_columns:
    min_val = ecommerce_data[col].min()
    max_val = ecommerce_data[col].max()
    range_val = max_val - min_val
    print(f"   • {col}: {range_val:.2f} (Min: {min_val:.2f}, Max: {max_val:.2f})")

print("\n2️⃣ VARIANCE ANALYSIS:")
print("   Purpose: Measures average squared deviation from the mean")
print("   Interpretation: Higher variance = more scattered data")

for col in numerical_columns:
    variance = ecommerce_data[col].var()
    print(f"   • {col}: {variance:.2f}")

print("\n3️⃣ STANDARD DEVIATION ANALYSIS:")
print("   Purpose: Shows typical deviation from the mean (same units as data)")
print("   Interpretation: ~68% of data falls within 1 std dev of mean")

for col in numerical_columns:
    std_dev = ecommerce_data[col].std()
    mean_val = ecommerce_data[col].mean()
    print(f"   • {col}: {std_dev:.2f} (Mean ± Std: {mean_val-std_dev:.2f} to {mean_val+std_dev:.2f})")

print("\n4️⃣ COEFFICIENT OF VARIATION ANALYSIS:")
print("   Purpose: Relative variability (std dev / mean)")
print("   Interpretation: Higher CV = more relative variability")

for col in numerical_columns:
    cv = (ecommerce_data[col].std() / ecommerce_data[col].mean()) * 100
    print(f"   • {col}: {cv:.1f}%")
```

<!-- CELL BREAK -->

### 1.5 Distribution Shape Measures

**What is Distribution Shape?**
Distribution shape describes how the data values are distributed across the range. Understanding shape helps us choose appropriate analysis methods.

#### 📊 Shape Measures Comparison

| **Measure** | **What it Shows** | **Values** | **Interpretation** |
|-------------|-------------------|------------|-------------------|
| **Skewness** | Asymmetry of distribution | < 0: Left-skewed, 0: Symmetric, > 0: Right-skewed | Shows direction of tail |
| **Kurtosis** | Peakedness/tailedness | < 3: Light-tailed, 3: Normal, > 3: Heavy-tailed | Shows concentration around mean |

<!-- CELL BREAK -->

```python
# Distribution Shape Analysis
print("\n📊 DISTRIBUTION SHAPE ANALYSIS:")
print("="*50)

print("\n1️⃣ SKEWNESS ANALYSIS:")
print("   Purpose: Measures asymmetry of the distribution")
print("   Interpretation:")
print("     • Negative skew: Tail extends to the left (higher values more common)")
print("     • Zero skew: Symmetric distribution")  
print("     • Positive skew: Tail extends to the right (lower values more common)")

for col in numerical_columns:
    skewness = ecommerce_data[col].skew()
    if skewness < -0.5:
        interpretation = "Left-skewed (tail to left)"
    elif skewness > 0.5:
        interpretation = "Right-skewed (tail to right)"
    else:
        interpretation = "Approximately symmetric"
    
    print(f"   • {col}: {skewness:.2f} ({interpretation})")

print("\n2️⃣ KURTOSIS ANALYSIS:")
print("   Purpose: Measures peakedness and tail heaviness")
print("   Interpretation:")
print("     • High kurtosis: Sharp peak, heavy tails")
print("     • Low kurtosis: Flat peak, light tails")
print("     • Normal distribution kurtosis ≈ 3")

for col in numerical_columns:
    kurt = ecommerce_data[col].kurtosis()
    if kurt > 3:
        interpretation = "Heavy-tailed (more extreme values)"
    elif kurt < 3:
        interpretation = "Light-tailed (fewer extreme values)"
    else:
        interpretation = "Normal-like tails"
    
    print(f"   • {col}: {kurt:.2f} ({interpretation})")
```

<!-- CELL BREAK -->

### 1.6 Position Measures (Percentiles and Quartiles)

**What are Position Measures?**
Position measures help us understand where specific values fall within the distribution and are useful for identifying outliers and creating benchmarks.

#### 📊 Position Measures Comparison

| **Measure** | **Description** | **Business Use** | **Example Application** |
|-------------|-----------------|------------------|------------------------|
| **Quartiles** | Divide data into 4 equal parts | Performance ranking | Top 25% customers by sales |
| **Percentiles** | Divide data into 100 equal parts | Detailed positioning | 95th percentile response time |
| **Deciles** | Divide data into 10 equal parts | Performance segments | Top decile performers |

<!-- CELL BREAK -->

```python
# Position Measures Analysis
print("\n📍 POSITION MEASURES ANALYSIS:")
print("="*50)

print("\n1️⃣ QUARTILE ANALYSIS:")
print("   Purpose: Divides data into 4 equal parts")
print("   Q1 (25%): Bottom quarter, Q2 (50%): Median, Q3 (75%): Top quarter")

for col in numerical_columns:
    q1 = ecommerce_data[col].quantile(0.25)
    q2 = ecommerce_data[col].quantile(0.50)  # Same as median
    q3 = ecommerce_data[col].quantile(0.75)
    iqr = q3 - q1  # Interquartile Range
    
    print(f"   • {col}:")
    print(f"     Q1 (25%): {q1:.2f}")
    print(f"     Q2 (50%): {q2:.2f}")  
    print(f"     Q3 (75%): {q3:.2f}")
    print(f"     IQR: {iqr:.2f}")

print("\n2️⃣ PERCENTILE ANALYSIS:")
print("   Purpose: Shows value positions in the distribution")
print("   Useful for: Setting benchmarks, identifying top/bottom performers")

key_percentiles = [5, 10, 25, 50, 75, 90, 95]

for col in numerical_columns[:2]:  # Show for first 2 columns to save space
    print(f"   • {col}:")
    for p in key_percentiles:
        value = ecommerce_data[col].quantile(p/100)
        print(f"     {p}th percentile: {value:.2f}")

print("\n3️⃣ BUSINESS INSIGHTS FROM POSITION MEASURES:")
print("   Example: Customer Age Analysis")

# Customer age insights
age_q1 = ecommerce_data['age'].quantile(0.25)
age_q3 = ecommerce_data['age'].quantile(0.75)
age_95th = ecommerce_data['age'].quantile(0.95)

print(f"   • 25% of customers are younger than {age_q1:.0f} years")
print(f"   • 50% of customers are between {age_q1:.0f} and {age_q3:.0f} years")  
print(f"   • Top 5% oldest customers are {age_95th:.0f}+ years")

print("\n   Example: Sales Amount Analysis")
sales_q1 = ecommerce_data['total_amount'].quantile(0.25)
sales_q3 = ecommerce_data['total_amount'].quantile(0.75)
sales_90th = ecommerce_data['total_amount'].quantile(0.90)

print(f"   • 25% of transactions are under ${sales_q1:.2f}")
print(f"   • 50% of transactions are between ${sales_q1:.2f} and ${sales_q3:.2f}")
print(f"   • Top 10% highest transactions are ${sales_90th:.2f}+")
```

<!-- CELL BREAK -->

### 1.7 Complete Descriptive Statistics Summary

Now let's create a comprehensive summary using pandas built-in functions and interpret the results for business insights.

<!-- CELL BREAK -->

```python
# Comprehensive Descriptive Statistics Summary
print("\n📋 COMPREHENSIVE DESCRIPTIVE STATISTICS SUMMARY:")
print("="*60)

print("\n1️⃣ PANDAS DESCRIBE() FUNCTION:")
print("   Purpose: Provides complete statistical summary")
print("   Includes: Count, Mean, Std, Min, 25%, 50%, 75%, Max")

numerical_summary = ecommerce_data[numerical_columns].describe()
print("\nNumerical Variables Summary:")
print(numerical_summary.round(2))

print("\n2️⃣ CATEGORICAL VARIABLES SUMMARY:")
print("   Purpose: Shows frequency distributions for categorical data")

categorical_summary = ecommerce_data[categorical_columns].describe()
print("\nCategorical Variables Summary:")
print(categorical_summary)

print("\n3️⃣ DETAILED FREQUENCY ANALYSIS:")
print("   Purpose: Understanding distribution of categorical variables")

for col in categorical_columns:
    print(f"\n   {col.upper()} DISTRIBUTION:")
    freq_counts = ecommerce_data[col].value_counts()
    freq_percentages = ecommerce_data[col].value_counts(normalize=True) * 100
    
    for category in freq_counts.index:
        count = freq_counts[category]
        percentage = freq_percentages[category]
        print(f"     • {category}: {count:,} ({percentage:.1f}%)")

print("\n4️⃣ BUSINESS INSIGHTS FROM DESCRIPTIVE STATISTICS:")

# Age insights
avg_age = ecommerce_data['age'].mean()
age_std = ecommerce_data['age'].std()
print(f"\n   👥 CUSTOMER DEMOGRAPHICS:")
print(f"   • Average customer age: {avg_age:.1f} years")
print(f"   • Age diversity (std dev): {age_std:.1f} years")
print(f"   • Most customers are between {avg_age-age_std:.0f}-{avg_age+age_std:.0f} years")

# Sales insights  
avg_transaction = ecommerce_data['total_amount'].mean()
median_transaction = ecommerce_data['total_amount'].median()
print(f"\n   💰 SALES PATTERNS:")
print(f"   • Average transaction: ${avg_transaction:.2f}")
print(f"   • Median transaction: ${median_transaction:.2f}")
if avg_transaction > median_transaction:
    print(f"   • Distribution is right-skewed (few high-value transactions)")
else:
    print(f"   • Distribution is balanced or left-skewed")

# Product insights
avg_price = ecommerce_data['price'].mean()
price_cv = (ecommerce_data['price'].std() / avg_price) * 100
print(f"\n   🛍️ PRODUCT PRICING:")
print(f"   • Average product price: ${avg_price:.2f}")
print(f"   • Price variability: {price_cv:.1f}% (Coefficient of Variation)")
if price_cv > 50:
    print(f"   • High price diversity across products")
else:
    print(f"   • Moderate price consistency across products")
```

<!-- CELL BREAK -->

---
## 🔗 Part 2: Correlation Analysis and Relationships

<!-- CELL BREAK -->

### 2.1 Understanding Correlation

**What is Correlation?**
Correlation measures the strength and direction of the linear relationship between two variables. It helps us understand how variables move together.

**Important Note:** Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other.

#### 📊 Correlation Methods Comparison

| **Method** | **Type of Data** | **Range** | **When to Use** | **Interpretation** |
|------------|------------------|-----------|-----------------|-------------------|
| **Pearson** | Continuous, linear relationships | -1 to +1 | Normal distributions, linear relationships | Most common correlation measure |
| **Spearman** | Ordinal, non-linear relationships | -1 to +1 | Non-normal distributions, monotonic relationships | Rank-based correlation |
| **Kendall's Tau** | Ordinal, small samples | -1 to +1 | Small datasets, many tied ranks | More robust than Spearman |

#### 🎯 Correlation Strength Interpretation

| **Correlation Value** | **Strength** | **Business Meaning** |
|----------------------|--------------|---------------------|
| 0.0 to 0.3 | Weak | Little relationship |
| 0.3 to 0.7 | Moderate | Some relationship worth investigating |
| 0.7 to 1.0 | Strong | Strong relationship, important for business |

<!-- CELL BREAK -->

```python
# Correlation Analysis
print("🔗 CORRELATION ANALYSIS:")
print("="*50)

print("\n1️⃣ PEARSON CORRELATION ANALYSIS:")
print("   Purpose: Measures linear relationships between numerical variables")
print("   Range: -1 (perfect negative) to +1 (perfect positive)")
print("   Requirements: Numerical data, linear relationships")

# Calculate Pearson correlation matrix
correlation_matrix = ecommerce_data[numerical_columns].corr()

print("\nPearson Correlation Matrix:")
print(correlation_matrix.round(3))

print("\n2️⃣ INTERPRETING CORRELATION VALUES:")
print("   Strong correlations (|r| > 0.7):")

# Find strong correlations
strong_correlations = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.7:
            var1 = correlation_matrix.columns[i]
            var2 = correlation_matrix.columns[j]
            strong_correlations.append((var1, var2, corr_value))

if strong_correlations:
    for var1, var2, corr in strong_correlations:
        direction = "Positive" if corr > 0 else "Negative"
        print(f"   • {var1} ↔ {var2}: {corr:.3f} ({direction})")
else:
    print("   • No strong correlations found")

print("\n   Moderate correlations (0.3 < |r| < 0.7):")
moderate_correlations = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if 0.3 < abs(corr_value) < 0.7:
            var1 = correlation_matrix.columns[i]
            var2 = correlation_matrix.columns[j]
            moderate_correlations.append((var1, var2, corr_value))

if moderate_correlations:
    for var1, var2, corr in moderate_correlations:
        direction = "Positive" if corr > 0 else "Negative"
        print(f"   • {var1} ↔ {var2}: {corr:.3f} ({direction})")
else:
    print("   • No moderate correlations found")
```

<!-- CELL BREAK -->

```python
# Spearman Correlation (Rank-based)
print("\n3️⃣ SPEARMAN CORRELATION ANALYSIS:")
print("   Purpose: Measures monotonic relationships (not necessarily linear)")
print("   Advantage: Works with non-normal data and non-linear relationships")
print("   Method: Based on rank ordering rather than actual values")

# Calculate Spearman correlation
spearman_corr = ecommerce_data[numerical_columns].corr(method='spearman')

print("\nSpearman Correlation Matrix:")
print(spearman_corr.round(3))

print("\n4️⃣ COMPARING PEARSON vs SPEARMAN:")
print("   Purpose: Identify non-linear relationships")

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        pearson_val = correlation_matrix.iloc[i, j]
        spearman_val = spearman_corr.iloc[i, j]
        
        if abs(pearson_val - spearman_val) > 0.1:  # Significant difference
            var1 = correlation_matrix.columns[i]
            var2 = correlation_matrix.columns[j]
            print(f"   • {var1} ↔ {var2}:")
            print(f"     Pearson: {pearson_val:.3f}, Spearman: {spearman_val:.3f}")
            print(f"     → Suggests non-linear relationship")

print("\n5️⃣ BUSINESS INSIGHTS FROM CORRELATIONS:")

# Age and spending correlation
age_spending_corr = ecommerce_data['age'].corr(ecommerce_data['total_amount'])
print(f"\n   👥 Age vs Spending: {age_spending_corr:.3f}")
if abs(age_spending_corr) > 0.3:
    direction = "increases" if age_spending_corr > 0 else "decreases"
    print(f"   → Customer spending {direction} with age")
else:
    print(f"   → Age has little relationship with spending amount")

# Price and quantity correlation  
price_quantity_corr = ecommerce_data['price'].corr(ecommerce_data['quantity'])
print(f"\n   💰 Price vs Quantity: {price_quantity_corr:.3f}")
if abs(price_quantity_corr) > 0.3:
    direction = "increases" if price_quantity_corr > 0 else "decreases"
    print(f"   → Quantity purchased {direction} with product price")
else:
    print(f"   → Product price has little impact on quantity purchased")
```

<!-- CELL BREAK -->

### 2.2 Correlation Visualization

Visual correlation analysis helps identify patterns that might not be obvious from numbers alone.

#### 📊 Correlation Visualization Methods

| **Method** | **Best For** | **Advantages** | **When to Use** |
|------------|--------------|----------------|-----------------|
| **Heatmap** | Overview of all correlations | Shows patterns across many variables | Initial correlation exploration |
| **Scatter Plot** | Individual variable relationships | Shows relationship shape and outliers | Detailed analysis of specific pairs |
| **Pair Plot** | Multiple variable relationships | Shows distributions + correlations | Comprehensive relationship analysis |

<!-- CELL BREAK -->

```python
# Correlation Visualization
print("\n📊 CORRELATION VISUALIZATION:")
print("="*50)

print("\n1️⃣ CORRELATION HEATMAP:")
print("   Purpose: Visual overview of all correlations")
print("   Benefits: Quick identification of strong relationships")

# Create correlation heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, fmt='.3f', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Heatmap - Numerical Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("   ✅ Heatmap created - Look for dark red (positive) and dark blue (negative) cells")

print("\n2️⃣ SCATTER PLOT ANALYSIS:")
print("   Purpose: Detailed view of individual relationships")
print("   Benefits: Shows relationship shape, outliers, and data distribution")

# Create scatter plots for interesting relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Scatter Plot Analysis - Key Relationships', fontsize=16, fontweight='bold')

# Plot 1: Age vs Total Amount
axes[0, 0].scatter(ecommerce_data['age'], ecommerce_data['total_amount'], alpha=0.6, color='blue')
axes[0, 0].set_xlabel('Customer Age')
axes[0, 0].set_ylabel('Total Amount ($)')
axes[0, 0].set_title('Age vs Total Amount')
axes[0, 0].grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(ecommerce_data['age'], ecommerce_data['total_amount'], 1)
p = np.poly1d(z)
axes[0, 0].plot(ecommerce_data['age'], p(ecommerce_data['age']), "r--", alpha=0.8)

# Plot 2: Price vs Quantity
axes[0, 1].scatter(ecommerce_data['price'], ecommerce_data['quantity'], alpha=0.6, color='green')
axes[0, 1].set_xlabel('Product Price ($)')
axes[0, 1].set_ylabel('Quantity Purchased')
axes[0, 1].set_title('Price vs Quantity')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Price vs Total Amount
axes[1, 0].scatter(ecommerce_data['price'], ecommerce_data['total_amount'], alpha=0.6, color='orange')
axes[1, 0].set_xlabel('Product Price ($)')
axes[1, 0].set_ylabel('Total Amount ($)')
axes[1, 0].set_title('Price vs Total Amount')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Quantity vs Total Amount
axes[1, 1].scatter(ecommerce_data['quantity'], ecommerce_data['total_amount'], alpha=0.6, color='purple')
axes[1, 1].set_xlabel('Quantity Purchased')
axes[1, 1].set_ylabel('Total Amount ($)')
axes[1, 1].set_title('Quantity vs Total Amount')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("   ✅ Scatter plots created - Look for linear patterns, clusters, and outliers")
```

<!-- CELL BREAK -->

### 2.3 Categorical Variable Relationships

Understanding relationships between categorical variables requires different methods than numerical correlations.

#### 📊 Categorical Relationship Methods

| **Method** | **Purpose** | **When to Use** | **Output** |
|------------|-------------|-----------------|------------|
| **Cross-tabulation** | Frequency relationships | Two categorical variables | Counts and percentages |
| **Chi-square test** | Statistical significance | Test independence | P-value and test statistic |
| **Cramér's V** | Association strength | Strength of relationship | 0 to 1 (stronger = higher) |

<!-- CELL BREAK -->

```python
# Categorical Variable Relationships
print("\n🏷️ CATEGORICAL VARIABLE RELATIONSHIPS:")
print("="*50)

print("\n1️⃣ CROSS-TABULATION ANALYSIS:")
print("   Purpose: Shows frequency distribution between categorical variables")
print("   Benefits: Understand how categories relate to each other")

# Cross-tabulation: City vs Segment
print("\n   CUSTOMER CITY vs SEGMENT:")
city_segment_crosstab = pd.crosstab(ecommerce_data['city'], ecommerce_data['segment'])
print(city_segment_crosstab)

# Add percentages
print("\n   Percentage distribution (by city):")
city_segment_percent = pd.crosstab(ecommerce_data['city'], ecommerce_data['segment'], normalize='index') * 100
print(city_segment_percent.round(1))

print("\n2️⃣ CATEGORY vs NUMERICAL ANALYSIS:")
print("   Purpose: How categorical variables relate to numerical outcomes")

# Segment vs Average spending
print("\n   CUSTOMER SEGMENT vs AVERAGE SPENDING:")
segment_spending = ecommerce_data.groupby('segment')['total_amount'].agg(['mean', 'median', 'count'])
segment_spending.columns = ['Average_Spending', 'Median_Spending', 'Number_of_Transactions']
print(segment_spending.round(2))

# City vs Average spending
print("\n   CUSTOMER CITY vs AVERAGE SPENDING:")
city_spending = ecommerce_data.groupby('city')['total_amount'].agg(['mean', 'median', 'count'])
city_spending.columns = ['Average_Spending', 'Median_Spending', 'Number_of_Transactions']
print(city_spending.round(2))

# Product category vs metrics
print("\n   PRODUCT CATEGORY vs METRICS:")
category_metrics = ecommerce_data.groupby('category').agg({
    'total_amount': ['mean', 'median'],
    'quantity': 'mean',
    'price': 'mean'
}).round(2)
print(category_metrics)

print("\n3️⃣ CHI-SQUARE TEST OF INDEPENDENCE:")
print("   Purpose: Test if two categorical variables are independent")
print("   Null hypothesis: Variables are independent (not related)")
print("   Alternative: Variables are dependent (related)")

# Chi-square test for city vs segment
chi2_stat, p_value, dof, expected = stats.chi2_contingency(city_segment_crosstab)

print(f"\n   CITY vs SEGMENT Chi-square test:")
print(f"   • Chi-square statistic: {chi2_stat:.3f}")
print(f"   • P-value: {p_value:.3f}")
print(f"   • Degrees of freedom: {dof}")

if p_value < 0.05:
    print(f"   → Significant relationship (p < 0.05): City and segment are related")
else:
    print(f"   → No significant relationship (p >= 0.05): City and segment appear independent")

print("\n4️⃣ BUSINESS INSIGHTS FROM CATEGORICAL RELATIONSHIPS:")

# Find highest spending segment
highest_segment = segment_spending['Average_Spending'].idxmax()
highest_amount = segment_spending['Average_Spending'].max()
print(f"\n   💎 Highest value segment: {highest_segment} (${highest_amount:.2f} average)")

# Find best performing city
best_city = city_spending['Average_Spending'].idxmax()
best_city_amount = city_spending['Average_Spending'].max()
print(f"   🏙️ Best performing city: {best_city} (${best_city_amount:.2f} average)")

# Find most popular category
category_popularity = ecommerce_data['category'].value_counts()
most_popular = category_popularity.index[0]
most_popular_count = category_popularity.iloc[0]
print(f"   🛍️ Most popular category: {most_popular} ({most_popular_count:,} transactions)")
```

<!-- CELL BREAK -->

---
## 📈 Part 3: Distribution Analysis and Statistical Testing

<!-- CELL BREAK -->

### 3.1 Understanding Data Distributions

**What is a Distribution?**
A distribution shows how values of a variable are spread across different ranges. Understanding distributions helps us choose appropriate statistical methods and identify patterns.

#### 📊 Common Distribution Types

| **Distribution** | **Shape** | **Characteristics** | **Examples** | **Statistical Tests** |
|------------------|-----------|--------------------|--------------|--------------------|
| **Normal** | Bell-shaped, symmetric | Mean = Median = Mode | Heights, test scores | t-test, ANOVA |
| **Right-skewed** | Tail extends right | Mean > Median | Income, sales amounts | Non-parametric tests |
| **Left-skewed** | Tail extends left | Mean < Median | Age at retirement | Non-parametric tests |
| **Uniform** | Flat, all values equal | No clear central tendency | Random numbers | Chi-square tests |
| **Bimodal** | Two peaks | Two common value ranges | Mixed populations | Mixture analysis |

<!-- CELL BREAK -->

```python
# Distribution Analysis
print("📈 DISTRIBUTION ANALYSIS:")
print("="*50)

print("\n1️⃣ VISUAL DISTRIBUTION ANALYSIS:")
print("   Purpose: Understand the shape and characteristics of data distributions")

# Create distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Distribution Analysis - Key Variables', fontsize=16, fontweight='bold')

# Plot 1: Age Distribution
axes[0, 0].hist(ecommerce_data['age'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].axvline(ecommerce_data['age'].mean(), color='red', linestyle='--', label=f'Mean: {ecommerce_data["age"].mean():.1f}')
axes[0, 0].axvline(ecommerce_data['age'].median(), color='green', linestyle='--', label=f'Median: {ecommerce_data["age"].median():.1f}')
axes[0, 0].set_xlabel('Customer Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Total Amount Distribution
axes[0, 1].hist(ecommerce_data['total_amount'], bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
axes[0, 1].axvline(ecommerce_data['total_amount'].mean(), color='red', linestyle='--', label=f'Mean: ${ecommerce_data["total_amount"].mean():.0f}')
axes[0, 1].axvline(ecommerce_data['total_amount'].median(), color='green', linestyle='--', label=f'Median: ${ecommerce_data["total_amount"].median():.0f}')
axes[0, 1].set_xlabel('Total Amount ($)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Total Amount Distribution')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Price Distribution
axes[1, 0].hist(ecommerce_data['price'], bins=30, alpha=0.7, color='lightcoral', edgecolor='black')
axes[1, 0].axvline(ecommerce_data['price'].mean(), color='red', linestyle='--', label=f'Mean: ${ecommerce_data["price"].mean():.0f}')
axes[1, 0].axvline(ecommerce_data['price'].median(), color='green', linestyle='--', label=f'Median: ${ecommerce_data["price"].median():.0f}')
axes[1, 0].set_xlabel('Product Price ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Product Price Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Quantity Distribution
axes[1, 1].hist(ecommerce_data['quantity'], bins=range(1, ecommerce_data['quantity'].max()+2), 
                alpha=0.7, color='lightyellow', edgecolor='black')
axes[1, 1].axvline(ecommerce_data['quantity'].mean(), color='red', linestyle='--', label=f'Mean: {ecommerce_data["quantity"].mean():.1f}')
axes[1, 1].axvline(ecommerce_data['quantity'].median(), color='green', linestyle='--', label=f'Median: {ecommerce_data["quantity"].median():.1f}')
axes[1, 1].set_xlabel('Quantity Purchased')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Quantity Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("   ✅ Distribution plots created")

print("\n2️⃣ DISTRIBUTION SHAPE ANALYSIS:")
print("   Purpose: Classify distribution shapes for appropriate analysis methods")

distributions = ['age', 'total_amount', 'price', 'quantity']

for var in distributions:
    mean_val = ecommerce_data[var].mean()
    median_val = ecommerce_data[var].median()
    skewness = ecommerce_data[var].skew()
    
    # Determine distribution shape
    if abs(skewness) < 0.5:
        shape = "Approximately Normal"
    elif skewness > 0.5:
        shape = "Right-skewed (Positive skew)"
    else:
        shape = "Left-skewed (Negative skew)"
    
    # Compare mean and median
    if abs(mean_val - median_val) / median_val < 0.1:
        central_tendency = "Mean ≈ Median (symmetric)"
    elif mean_val > median_val:
        central_tendency = "Mean > Median (right tail)"
    else:
        central_tendency = "Mean < Median (left tail)"
    
    print(f"\n   📊 {var.upper()}:")
    print(f"      Shape: {shape}")
    print(f"      Central tendency: {central_tendency}")
    print(f"      Skewness: {skewness:.3f}")
```

<!-- CELL BREAK -->

### 3.2 Normality Testing

Testing whether data follows a normal distribution is crucial for choosing appropriate statistical methods.

#### 📊 Normality Tests Comparison

| **Test** | **Best For** | **Sample Size** | **Null Hypothesis** | **Decision Rule** |
|----------|--------------|-----------------|-------------------|-------------------|
| **Shapiro-Wilk** | Small to medium samples | < 5000 | Data is normally distributed | p > 0.05 = Normal |
| **Anderson-Darling** | Any sample size | Any | Data is normally distributed | p > 0.05 = Normal |
| **Kolmogorov-Smirnov** | Large samples | > 2000 | Data is normally distributed | p > 0.05 = Normal |
| **D'Agostino** | Large samples | > 20 | Data is normally distributed | p > 0.05 = Normal |

<!-- CELL BREAK -->

```python
# Normality Testing
print("\n🔍 NORMALITY TESTING:")
print("="*50)

print("\n1️⃣ SHAPIRO-WILK TEST:")
print("   Purpose: Test if data comes from a normal distribution")
print("   Best for: Small to medium samples (< 5000)")
print("   Null hypothesis: Data is normally distributed")
print("   Decision: p > 0.05 suggests normal distribution")

for var in distributions:
    # Take sample for Shapiro-Wilk (works best with smaller samples)
    sample_data = ecommerce_data[var].dropna().sample(min(1000, len(ecommerce_data[var])))
    
    stat, p_value = stats.shapiro(sample_data)
    
    interpretation = "Normally distributed" if p_value > 0.05 else "Not normally distributed"
    
    print(f"\n   {var.upper()}:")
    print(f"   • Statistic: {stat:.6f}")
    print(f"   • P-value: {p_value:.6f}")
    print(f"   • Result: {interpretation}")

print("\n2️⃣ ANDERSON-DARLING TEST:")
print("   Purpose: More powerful test for normality")
print("   Advantage: Works well with any sample size")

for var in distributions:
    result = stats.anderson(ecommerce_data[var].dropna(), dist='norm')
    
    print(f"\n   {var.upper()}:")
    print(f"   • Statistic: {result.statistic:.6f}")
    print(f"   • Critical values: {result.critical_values}")
    print(f"   • Significance levels: {result.significance_level}%")
    
    # Check against 5% significance level
    critical_5 = result.critical_values[2]  # 5% level is usually index 2
    if result.statistic < critical_5:
        print(f"   • Result: Normally distributed (at 5% level)")
    else:
        print(f"   • Result: Not normally distributed (at 5% level)")

print("\n3️⃣ Q-Q PLOT ANALYSIS:")
print("   Purpose: Visual assessment of normality")
print("   Interpretation: Points on straight line = normal distribution")

# Create Q-Q plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Q-Q Plots for Normality Assessment', fontsize=16, fontweight='bold')

variables_for_qq = ['age', 'total_amount', 'price', 'quantity']
plot_positions = [(0,0), (0,1), (1,0), (1,1)]

for i, var in enumerate(variables_for_qq):
    row, col = plot_positions[i]
    stats.probplot(ecommerce_data[var].dropna(), dist="norm", plot=axes[row, col])
    axes[row, col].set_title(f'Q-Q Plot: {var.title()}')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("   ✅ Q-Q plots created - Straight line indicates normal distribution")

print("\n4️⃣ BUSINESS IMPLICATIONS OF NORMALITY:")

print("\n   📊 STATISTICAL METHOD RECOMMENDATIONS:")
normal_vars = []
non_normal_vars = []

for var in distributions:
    sample_data = ecommerce_data[var].dropna().sample(min(1000, len(ecommerce_data[var])))
    _, p_value = stats.shapiro(sample_data)
    
    if p_value > 0.05:
        normal_vars.append(var)
    else:
        non_normal_vars.append(var)

if normal_vars:
    print(f"\n   ✅ Normal distributions: {', '.join(normal_vars)}")
    print(f"      → Use: t-tests, ANOVA, Pearson correlation")

if non_normal_vars:
    print(f"\n   ⚠️ Non-normal distributions: {', '.join(non_normal_vars)}")
    print(f"      → Use: Mann-Whitney U, Kruskal-Wallis, Spearman correlation")
```

<!-- CELL BREAK -->

### 3.3 Statistical Hypothesis Testing

Hypothesis testing helps us make data-driven decisions by testing specific claims about our data.

#### 📊 Common Statistical Tests Comparison

| **Test** | **Purpose** | **Data Requirements** | **Example Question** |
|----------|-------------|----------------------|---------------------|
| **One-sample t-test** | Compare mean to target value | Normal distribution, continuous | Is average age = 30? |
| **Two-sample t-test** | Compare means of two groups | Normal distribution, continuous | Do men spend more than women? |
| **Mann-Whitney U** | Compare two groups (non-parametric) | Ordinal data, any distribution | Do segments differ in spending? |
| **Chi-square test** | Test categorical associations | Categorical data, adequate sample | Are city and segment related? |
| **ANOVA** | Compare multiple group means | Normal distribution, continuous | Do all cities have same average spending? |

<!-- CELL BREAK -->

```python
# Statistical Hypothesis Testing
print("\n🧪 STATISTICAL HYPOTHESIS TESTING:")
print("="*50)

print("\n1️⃣ ONE-SAMPLE T-TEST:")
print("   Purpose: Test if a sample mean equals a specific value")
print("   Example: Is the average customer age significantly different from 35?")

# Test if average age is significantly different from 35
age_data = ecommerce_data['age'].dropna()
target_age = 35

t_stat, p_value = stats.ttest_1samp(age_data, target_age)

print(f"\n   HYPOTHESIS TEST:")
print(f"   • Null hypothesis: Average age = {target_age}")
print(f"   • Alternative hypothesis: Average age ≠ {target_age}")
print(f"   • Sample mean: {age_data.mean():.2f}")
print(f"   • T-statistic: {t_stat:.4f}")
print(f"   • P-value: {p_value:.6f}")

if p_value < 0.05:
    direction = "higher" if age_data.mean() > target_age else "lower"
    print(f"   • Result: Significant difference - Average age is significantly {direction} than {target_age}")
else:
    print(f"   • Result: No significant difference - Average age is not significantly different from {target_age}")

print("\n2️⃣ TWO-SAMPLE T-TEST:")
print("   Purpose: Compare means between two groups")
print("   Example: Do Premium and Basic customers spend differently?")

# Compare spending between Premium and Basic customers
premium_spending = ecommerce_data[ecommerce_data['segment'] == 'Premium']['total_amount']
basic_spending = ecommerce_data[ecommerce_data['segment'] == 'Basic']['total_amount']

t_stat, p_value = stats.ttest_ind(premium_spending.dropna(), basic_spending.dropna())

print(f"\n   GROUP COMPARISON:")
print(f"   • Premium customers mean: ${premium_spending.mean():.2f}")
print(f"   • Basic customers mean: ${basic_spending.mean():.2f}")
print(f"   • Difference: ${premium_spending.mean() - basic_spending.mean():.2f}")
print(f"   • T-statistic: {t_stat:.4f}")
print(f"   • P-value: {p_value:.6f}")

if p_value < 0.05:
    higher_group = "Premium" if premium_spending.mean() > basic_spending.mean() else "Basic"
    print(f"   • Result: Significant difference - {higher_group} customers spend significantly more")
else:
    print(f"   • Result: No significant difference in spending between Premium and Basic customers")

print("\n3️⃣ MANN-WHITNEY U TEST (NON-PARAMETRIC):")
print("   Purpose: Compare two groups when data is not normally distributed")
print("   Example: Compare spending between two cities")

riyadh_spending = ecommerce_data[ecommerce_data['city'] == 'Riyadh']['total_amount']
jeddah_spending = ecommerce_data[ecommerce_data['city'] == 'Jeddah']['total_amount']

u_stat, p_value = stats.mannwhitneyu(riyadh_spending.dropna(), jeddah_spending.dropna(), alternative='two-sided')

print(f"\n   CITY COMPARISON (NON-PARAMETRIC):")
print(f"   • Riyadh median: ${riyadh_spending.median():.2f}")
print(f"   • Jeddah median: ${jeddah_spending.median():.2f}")
print(f"   • U-statistic: {u_stat:.4f}")
print(f"   • P-value: {p_value:.6f}")

if p_value < 0.05:
    higher_city = "Riyadh" if riyadh_spending.median() > jeddah_spending.median() else "Jeddah"
    print(f"   • Result: Significant difference - {higher_city} has significantly higher spending")
else:
    print(f"   • Result: No significant difference in spending between Riyadh and Jeddah")

print("\n4️⃣ ANOVA (ANALYSIS OF VARIANCE):")
print("   Purpose: Compare means across multiple groups")
print("   Example: Do all product categories have the same average price?")

# Get price data for each category
category_groups = []
category_names = []
for category in ecommerce_data['category'].unique():
    category_prices = ecommerce_data[ecommerce_data['category'] == category]['price']
    category_groups.append(category_prices.dropna())
    category_names.append(category)

f_stat, p_value = stats.f_oneway(*category_groups)

print(f"\n   MULTIPLE GROUP COMPARISON:")
for i, category in enumerate(category_names):
    print(f"   • {category} mean price: ${category_groups[i].mean():.2f}")

print(f"\n   • F-statistic: {f_stat:.4f}")
print(f"   • P-value: {p_value:.6f}")

if p_value < 0.05:
    print(f"   • Result: Significant difference - Product categories have significantly different average prices")
else:
    print(f"   • Result: No significant difference - All product categories have similar average prices")

print("\n5️⃣ BUSINESS INSIGHTS FROM STATISTICAL TESTS:")

insights = []

# Age insight
if abs(age_data.mean() - target_age) > 2:  # Practical significance
    age_direction = "older" if age_data.mean() > target_age else "younger"
    insights.append(f"Customer base is significantly {age_direction} than expected (avg: {age_data.mean():.1f} vs target: {target_age})")

# Segment insight
if premium_spending.mean() > basic_spending.mean() * 1.2:  # 20% higher
    insights.append(f"Premium customers spend {((premium_spending.mean()/basic_spending.mean()-1)*100):.0f}% more than Basic customers")

# Category insight
price_ranges = [group.mean() for group in category_groups]
if max(price_ranges) > min(price_ranges) * 2:  # 2x difference
    highest_cat = category_names[np.argmax(price_ranges)]
    lowest_cat = category_names[np.argmin(price_ranges)]
    insights.append(f"{highest_cat} products are much more expensive than {lowest_cat} products")

print(f"\n   💡 KEY BUSINESS INSIGHTS:")
for insight in insights:
    print(f"   • {insight}")
```

<!-- CELL BREAK -->

---
## 🎨 Part 4: Advanced Visualization for Pattern Discovery

<!-- CELL BREAK -->

### 4.1 Visualization Strategy for EDA

**Why Visualization in EDA?**
Visual analysis can reveal patterns, outliers, and relationships that statistical summaries might miss. Different chart types are suited for different types of analysis.

#### 📊 Visualization Methods for Different Data Types

| **Data Type** | **Visualization** | **Purpose** | **When to Use** |
|---------------|------------------|-------------|-----------------|
| **Single Numerical** | Histogram, Box plot, Violin plot | Distribution analysis | Understand data spread and shape |
| **Two Numerical** | Scatter plot, Line plot | Relationship analysis | Find correlations and trends |
| **Categorical** | Bar chart, Pie chart | Frequency analysis | Compare categories |
| **Numerical + Categorical** | Box plot, Violin plot by group | Group comparisons | Compare distributions across groups |
| **Time Series** | Line plot, Area chart | Trend analysis | Identify patterns over time |
| **Multiple Variables** | Pair plot, Heatmap | Comprehensive analysis | Overview of many relationships |

<!-- CELL BREAK -->

```python
# Advanced Visualization for Pattern Discovery
print("🎨 ADVANCED VISUALIZATION FOR PATTERN DISCOVERY:")
print("="*60)

print("\n1️⃣ DISTRIBUTION COMPARISON VISUALIZATIONS:")
print("   Purpose: Compare distributions across different groups")

# Box plot comparison by segment
plt.figure(figsize=(15, 5))

# Subplot 1: Total Amount by Segment
plt.subplot(1, 3, 1)
segment_order = ['Basic', 'Standard', 'Premium']  # Order for logical progression
box_data = [ecommerce_data[ecommerce_data['segment'] == seg]['total_amount'].dropna() for seg in segment_order]
box_plot = plt.boxplot(box_data, labels=segment_order, patch_artist=True)

# Color the boxes
colors = ['lightcoral', 'lightblue', 'lightgreen']
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)

plt.title('Total Amount by Customer Segment')
plt.xlabel('Customer Segment')
plt.ylabel('Total Amount ($)')
plt.grid(True, alpha=0.3)

# Subplot 2: Age by City
plt.subplot(1, 3, 2)
city_order = ['Riyadh', 'Jeddah', 'Dammam', 'Mecca', 'Medina']
age_data = [ecommerce_data[ecommerce_data['city'] == city]['age'].dropna() for city in city_order if city in ecommerce_data['city'].unique()]
city_labels = [city for city in city_order if city in ecommerce_data['city'].unique()]

violin_parts = plt.violinplot(age_data, positions=range(1, len(age_data)+1), showmeans=True)
plt.xticks(range(1, len(city_labels)+1), city_labels, rotation=45)
plt.title('Age Distribution by City')
plt.xlabel('City')
plt.ylabel('Age')
plt.grid(True, alpha=0.3)

# Subplot 3: Price by Category
plt.subplot(1, 3, 3)
categories = ecommerce_data['category'].unique()
price_data = [ecommerce_data[ecommerce_data['category'] == cat]['price'].dropna() for cat in categories]

box_plot2 = plt.boxplot(price_data, labels=categories, patch_artist=True)
colors2 = ['yellow', 'orange', 'lightpink', 'lightcyan', 'lavender']
for patch, color in zip(box_plot2['boxes'], colors2[:len(categories)]):
    patch.set_facecolor(color)

plt.title('Price Distribution by Category')
plt.xlabel('Product Category')
plt.ylabel('Price ($)')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("   ✅ Distribution comparison plots created")

print("\n2️⃣ RELATIONSHIP DISCOVERY VISUALIZATIONS:")
print("   Purpose: Identify complex relationships between variables")

# Create advanced relationship plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Advanced Relationship Discovery', fontsize=16, fontweight='bold')

# Plot 1: Scatter with categorical coloring
scatter = axes[0, 0].scatter(ecommerce_data['age'], ecommerce_data['total_amount'], 
                           c=ecommerce_data['segment'].astype('category').cat.codes, 
                           alpha=0.6, cmap='viridis')
axes[0, 0].set_xlabel('Customer Age')
axes[0, 0].set_ylabel('Total Amount ($)')
axes[0, 0].set_title('Age vs Total Amount (colored by Segment)')
axes[0, 0].grid(True, alpha=0.3)

# Add legend for segments
unique_segments = ecommerce_data['segment'].unique()
for i, segment in enumerate(unique_segments):
    axes[0, 0].scatter([], [], c=plt.cm.viridis(i/len(unique_segments)), label=segment)
axes[0, 0].legend()

# Plot 2: Hexbin plot for density visualization
hexbin = axes[0, 1].hexbin(ecommerce_data['price'], ecommerce_data['total_amount'], 
                          gridsize=20, cmap='Blues', alpha=0.7)
axes[0, 1].set_xlabel('Product Price ($)')
axes[0, 1].set_ylabel('Total Amount ($)')
axes[0, 1].set_title('Price vs Total Amount (Density Plot)')
plt.colorbar(hexbin, ax=axes[0, 1])

# Plot 3: Bubble plot (3 dimensions)
bubble = axes[1, 0].scatter(ecommerce_data['age'], ecommerce_data['total_amount'], 
                           s=ecommerce_data['quantity']*20, alpha=0.6, 
                           c=ecommerce_data['price'], cmap='plasma')
axes[1, 0].set_xlabel('Customer Age')
axes[1, 0].set_ylabel('Total Amount ($)')
axes[1, 0].set_title('Age vs Total Amount (size=quantity, color=price)')
plt.colorbar(bubble, ax=axes[1, 0])

# Plot 4: Correlation with regression line by group
for segment in ecommerce_data['segment'].unique():
    segment_data = ecommerce_data[ecommerce_data['segment'] == segment]
    axes[1, 1].scatter(segment_data['age'], segment_data['total_amount'], 
                      alpha=0.6, label=segment)
    
    # Add regression line
    z = np.polyfit(segment_data['age'], segment_data['total_amount'], 1)
    p = np.poly1d(z)
    axes[1, 1].plot(segment_data['age'], p(segment_data['age']), '--', alpha=0.8)

axes[1, 1].set_xlabel('Customer Age')
axes[1, 1].set_ylabel('Total Amount ($)')
axes[1, 1].set_title('Age vs Total Amount by Segment (with trend lines)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("   ✅ Advanced relationship plots created")
```

<!-- CELL BREAK -->

### 4.2 Categorical Data Visualization

Categorical data requires special visualization techniques to show patterns and relationships effectively.

<!-- CELL BREAK -->

```python
print("\n3️⃣ CATEGORICAL DATA PATTERN ANALYSIS:")
print("   Purpose: Discover patterns in categorical variables")

# Create categorical analysis plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Categorical Data Pattern Analysis', fontsize=16, fontweight='bold')

# Plot 1: Count plot by category
category_counts = ecommerce_data['category'].value_counts()
bars1 = axes[0, 0].bar(category_counts.index, category_counts.values, color='skyblue', alpha=0.8)
axes[0, 0].set_title('Transaction Count by Category')
axes[0, 0].set_xlabel('Product Category')
axes[0, 0].set_ylabel('Number of Transactions')
axes[0, 0].tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar in bars1:
    height = bar.get_height()
    axes[0, 0].text(bar.get_x() + bar.get_width()/2., height + 20,
                    f'{int(height)}', ha='center', va='bottom')

# Plot 2: Segment distribution pie chart
segment_counts = ecommerce_data['segment'].value_counts()
colors_pie = ['lightcoral', 'lightblue', 'lightgreen']
wedges, texts, autotexts = axes[0, 1].pie(segment_counts.values, labels=segment_counts.index, 
                                          autopct='%1.1f%%', colors=colors_pie)
axes[0, 1].set_title('Customer Segment Distribution')

# Plot 3: City distribution horizontal bar
city_counts = ecommerce_data['city'].value_counts()
bars2 = axes[0, 2].barh(city_counts.index, city_counts.values, color='lightgreen', alpha=0.8)
axes[0, 2].set_title('Customer Distribution by City')
axes[0, 2].set_xlabel('Number of Transactions')
axes[0, 2].set_ylabel('City')

# Add value labels
for i, bar in enumerate(bars2):
    width = bar.get_width()
    axes[0, 2].text(width + 50, bar.get_y() + bar.get_height()/2,
                    f'{int(width)}', ha='left', va='center')

# Plot 4: Stacked bar chart - Category by Segment
category_segment_crosstab = pd.crosstab(ecommerce_data['category'], ecommerce_data['segment'])
category_segment_crosstab.plot(kind='bar', stacked=True, ax=axes[1, 0], color=['lightcoral', 'lightblue', 'lightgreen'])
axes[1, 0].set_title('Category Distribution by Segment (Stacked)')
axes[1, 0].set_xlabel('Product Category')
axes[1, 0].set_ylabel('Number of Transactions')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].legend(title='Segment')

# Plot 5: Grouped bar chart - Average spending by category and segment
avg_spending = ecommerce_data.groupby(['category', 'segment'])['total_amount'].mean().unstack()
avg_spending.plot(kind='bar', ax=axes[1, 1], color=['lightcoral', 'lightblue', 'lightgreen'])
axes[1, 1].set_title('Average Spending by Category and Segment')
axes[1, 1].set_xlabel('Product Category')
axes[1, 1].set_ylabel('Average Total Amount ($)')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].legend(title='Segment')

# Plot 6: Heatmap of cross-tabulation
city_category_crosstab = pd.crosstab(ecommerce_data['city'], ecommerce_data['category'])
im = axes[1, 2].imshow(city_category_crosstab.values, cmap='Blues', aspect='auto')
axes[1, 2].set_xticks(range(len(city_category_crosstab.columns)))
axes[1, 2].set_yticks(range(len(city_category_crosstab.index)))
axes[1, 2].set_xticklabels(city_category_crosstab.columns, rotation=45)
axes[1, 2].set_yticklabels(city_category_crosstab.index)
axes[1, 2].set_title('City vs Category Heatmap')
axes[1, 2].set_xlabel('Product Category')
axes[1, 2].set_ylabel('City')

# Add colorbar
plt.colorbar(im, ax=axes[1, 2])

# Add text annotations to heatmap
for i in range(len(city_category_crosstab.index)):
    for j in range(len(city_category_crosstab.columns)):
        text = axes[1, 2].text(j, i, city_category_crosstab.iloc[i, j],
                             ha="center", va="center", color="black", fontsize=8)

plt.tight_layout()
plt.show()

print("   ✅ Categorical pattern analysis completed")

print("\n4️⃣ KEY PATTERNS DISCOVERED:")

# Analyze the patterns
most_popular_category = ecommerce_data['category'].value_counts().index[0]
most_popular_count = ecommerce_data['category'].value_counts().iloc[0]

largest_segment = ecommerce_data['segment'].value_counts().index[0]
largest_segment_pct = (ecommerce_data['segment'].value_counts().iloc[0] / len(ecommerce_data)) * 100

most_active_city = ecommerce_data['city'].value_counts().index[0]
most_active_count = ecommerce_data['city'].value_counts().iloc[0]

print(f"   🛍️ Most popular category: {most_popular_category} ({most_popular_count:,} transactions)")
print(f"   👥 Largest customer segment: {largest_segment} ({largest_segment_pct:.1f}% of customers)")
print(f"   🏙️ Most active city: {most_active_city} ({most_active_count:,} transactions)")

# Cross-category insights
highest_spending_combo = avg_spending.stack().idxmax()
highest_spending_amount = avg_spending.stack().max()
category, segment = highest_spending_combo
print(f"   💰 Highest spending combination: {segment} customers buying {category} (${highest_spending_amount:.2f} average)")
```

<!-- CELL BREAK -->

---
## ⏰ Part 5: Time Series Analysis and Trends

<!-- CELL BREAK -->

### 5.1 Understanding Time Series Data

**What is Time Series Analysis?**
Time series analysis examines data points collected over time to identify trends, seasonal patterns, and cyclical behaviors. This is crucial for business forecasting and understanding customer behavior patterns.

#### 📊 Time Series Analysis Components

| **Component** | **Description** | **Business Example** | **Analysis Method** |
|---------------|-----------------|---------------------|-------------------|
| **Trend** | Long-term increase or decrease | Growing sales over years | Linear regression, moving averages |
| **Seasonality** | Regular patterns that repeat | Higher sales in holidays | Seasonal decomposition |
| **Cyclical** | Long-term fluctuations | Economic cycles | Cycle analysis |
| **Irregular** | Random fluctuations | Unexpected events | Outlier detection |

<!-- CELL BREAK -->

```python
# Time Series Analysis
print("⏰ TIME SERIES ANALYSIS AND TRENDS:")
print("="*50)

print("\n1️⃣ PREPARING TIME SERIES DATA:")
print("   Purpose: Aggregate transaction data by time periods")

# Create time-based aggregations
ecommerce_data['transaction_date'] = pd.to_datetime(ecommerce_data['transaction_date'])

# Daily aggregations
daily_sales = ecommerce_data.groupby(ecommerce_data['transaction_date'].dt.date).agg({
    'total_amount': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)

# Flatten column names
daily_sales.columns = ['daily_revenue', 'avg_transaction', 'transaction_count', 'total_quantity']
daily_sales.reset_index(inplace=True)
daily_sales['transaction_date'] = pd.to_datetime(daily_sales['transaction_date'])

print(f"   ✅ Created daily sales data: {len(daily_sales)} days")

# Monthly aggregations
monthly_sales = ecommerce_data.groupby([
    ecommerce_data['transaction_date'].dt.year, 
    ecommerce_data['transaction_date'].dt.month
]).agg({
    'total_amount': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)

monthly_sales.columns = ['monthly_revenue', 'avg_transaction', 'transaction_count', 'total_quantity']
monthly_sales.reset_index(inplace=True)
monthly_sales['date'] = pd.to_datetime(monthly_sales[['transaction_date', 'level_1']].rename(columns={'level_1': 'month'}))

print(f"   ✅ Created monthly sales data: {len(monthly_sales)} months")

# Weekly patterns
weekly_pattern = ecommerce_data.groupby(ecommerce_data['transaction_date'].dt.day_name()).agg({
    'total_amount': ['sum', 'mean', 'count']
}).round(2)
weekly_pattern.columns = ['weekly_revenue', 'avg_transaction', 'transaction_count']

# Reorder days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_pattern = weekly_pattern.reindex(day_order)

print(f"   ✅ Created weekly pattern analysis")

print("\n2️⃣ TREND ANALYSIS:")
print("   Purpose: Identify long-term patterns in the data")

# Create trend visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Time Series Analysis - Trends and Patterns', fontsize=16, fontweight='bold')

# Plot 1: Daily revenue trend
axes[0, 0].plot(daily_sales['transaction_date'], daily_sales['daily_revenue'], 
                color='blue', alpha=0.7, linewidth=1)

# Add moving average
daily_sales['ma_7'] = daily_sales['daily_revenue'].rolling(window=7).mean()
daily_sales['ma_30'] = daily_sales['daily_revenue'].rolling(window=30).mean()

axes[0, 0].plot(daily_sales['transaction_date'], daily_sales['ma_7'], 
                color='red', linewidth=2, label='7-day Moving Average')
axes[0, 0].plot(daily_sales['transaction_date'], daily_sales['ma_30'], 
                color='green', linewidth=2, label='30-day Moving Average')

axes[0, 0].set_title('Daily Revenue Trend')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Daily Revenue ($)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Monthly revenue trend
axes[0, 1].plot(monthly_sales['date'], monthly_sales['monthly_revenue'], 
                marker='o', color='green', linewidth=2, markersize=6)
axes[0, 1].set_title('Monthly Revenue Trend')
axes[0, 1].set_xlabel('Month')
axes[0, 1].set_ylabel('Monthly Revenue ($)')
axes[0, 1].grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(range(len(monthly_sales)), monthly_sales['monthly_revenue'], 1)
p = np.poly1d(z)
axes[0, 1].plot(monthly_sales['date'], p(range(len(monthly_sales))), 
                'r--', alpha=0.8, label=f'Trend Line')
axes[0, 1].legend()

# Plot 3: Weekly pattern
bars = axes[1, 0].bar(weekly_pattern.index, weekly_pattern['weekly_revenue'], 
                      color='skyblue', alpha=0.8)
axes[1, 0].set_title('Weekly Revenue Pattern')
axes[1, 0].set_xlabel('Day of Week')
axes[1, 0].set_ylabel('Total Revenue ($)')
axes[1, 0].tick_params(axis='x', rotation=45)

# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                    f'${height:,.0f}', ha='center', va='bottom', fontsize=9)

# Plot 4: Hourly pattern
hourly_pattern = ecommerce_data.groupby(ecommerce_data['hour'])['total_amount'].sum()
axes[1, 1].plot(hourly_pattern.index, hourly_pattern.values, 
                marker='o', color='orange', linewidth=2, markersize=6)
axes[1, 1].set_title('Hourly Sales Pattern')
axes[1, 1].set_xlabel('Hour of Day')
axes[1, 1].set_ylabel('Total Revenue ($)')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xticks(range(0, 24, 2))

plt.tight_layout()
plt.show()

print("   ✅ Trend analysis visualizations created")

print("\n3️⃣ SEASONAL PATTERN ANALYSIS:")
print("   Purpose: Identify recurring patterns and seasonality")

# Monthly pattern analysis
monthly_pattern = ecommerce_data.groupby(ecommerce_data['month'])['total_amount'].agg(['sum', 'mean', 'count'])
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

print("\n   📊 MONTHLY SEASONAL PATTERNS:")
print("   Month | Total Revenue | Avg Transaction | Count")
print("   ------|---------------|-----------------|------")
for month in monthly_pattern.index:
    total_rev = monthly_pattern.loc[month, 'sum']
    avg_trans = monthly_pattern.loc[month, 'mean']
    count = monthly_pattern.loc[month, 'count']
    month_name = month_names[month-1]
    print(f"   {month_name:5} | ${total_rev:12,.0f} | ${avg_trans:14.2f} | {count:5.0f}")

# Weekly pattern analysis
print("\n   📊 WEEKLY SEASONAL PATTERNS:")
print("   Day       | Total Revenue | Avg Transaction | Count")
print("   ----------|---------------|-----------------|------")
for day in weekly_pattern.index:
    total_rev = weekly_pattern.loc[day, 'weekly_revenue']
    avg_trans = weekly_pattern.loc[day, 'avg_transaction']
    count = weekly_pattern.loc[day, 'transaction_count']
    print(f"   {day:9} | ${total_rev:12,.0f} | ${avg_trans:14.2f} | {count:5.0f}")

print("\n4️⃣ TREND INSIGHTS AND BUSINESS IMPLICATIONS:")

# Calculate trend metrics
overall_trend = np.polyfit(range(len(daily_sales)), daily_sales['daily_revenue'], 1)[0]
recent_avg = daily_sales['daily_revenue'].tail(30).mean()
early_avg = daily_sales['daily_revenue'].head(30).mean()

print(f"\n   📈 TREND ANALYSIS:")
if overall_trend > 0:
    print(f"   • Overall trend: Positive (${overall_trend:.2f} per day increase)")
else:
    print(f"   • Overall trend: Negative (${abs(overall_trend):.2f} per day decrease)")

print(f"   • Recent 30-day average: ${recent_avg:.2f}")
print(f"   • Early 30-day average: ${early_avg:.2f}")
print(f"   • Growth rate: {((recent_avg/early_avg-1)*100):+.1f}%")

# Best and worst performing days/months
best_day = weekly_pattern['weekly_revenue'].idxmax()
worst_day = weekly_pattern['weekly_revenue'].idxmin()
best_month = monthly_pattern['sum'].idxmax()
worst_month = monthly_pattern['sum'].idxmin()

print(f"\n   🎯 PERFORMANCE INSIGHTS:")
print(f"   • Best performing day: {best_day} (${weekly_pattern.loc[best_day, 'weekly_revenue']:,.0f})")
print(f"   • Worst performing day: {worst_day} (${weekly_pattern.loc[worst_day, 'weekly_revenue']:,.0f})")
print(f"   • Best performing month: {month_names[best_month-1]} (${monthly_pattern.loc[best_month, 'sum']:,.0f})")
print(f"   • Worst performing month: {month_names[worst_month-1]} (${monthly_pattern.loc[worst_month, 'sum']:,.0f})")

# Hour pattern insights
peak_hour = hourly_pattern.idxmax()
low_hour = hourly_pattern.idxmin()
print(f"   • Peak sales hour: {peak_hour}:00 (${hourly_pattern[peak_hour]:,.0f})")
print(f"   • Lowest sales hour: {low_hour}:00 (${hourly_pattern[low_hour]:,.0f})")
```

<!-- CELL BREAK -->

---
## 💼 Part 6: Comparative Analysis and Business Insights

<!-- CELL BREAK -->

### 6.1 Comparative Analysis Techniques

**What is Comparative Analysis?**
Comparative analysis involves comparing different groups, time periods, or segments to identify differences and opportunities for improvement.

#### 📊 Comparative Analysis Methods

| **Comparison Type** | **Purpose** | **Methods** | **Business Applications** |
|-------------------|-------------|-------------|---------------------------|
| **Segment Comparison** | Compare customer groups | Mean comparison, statistical tests | Marketing strategy, pricing |
| **Time Period Comparison** | Compare different periods | Year-over-year, period-over-period | Performance tracking, trends |
| **Geographic Comparison** | Compare locations | Regional analysis | Expansion planning, localization |
| **Product Comparison** | Compare product performance | Category analysis | Inventory management, promotion |

<!-- CELL BREAK -->

```python
# Comparative Analysis and Business Insights
print("💼 COMPARATIVE ANALYSIS AND BUSINESS INSIGHTS:")
print("="*60)

print("\n1️⃣ CUSTOMER SEGMENT PERFORMANCE COMPARISON:")
print("   Purpose: Identify highest value customer segments")

# Comprehensive segment analysis
segment_analysis = ecommerce_data.groupby('segment').agg({
    'total_amount': ['sum', 'mean', 'median', 'count'],
    'quantity': ['sum', 'mean'],
    'age': 'mean',
    'customer_id': 'nunique'
}).round(2)

# Flatten column names
segment_analysis.columns = ['total_revenue', 'avg_order_value', 'median_order_value', 'order_count', 
                          'total_quantity', 'avg_quantity', 'avg_age', 'unique_customers']

# Calculate additional metrics
segment_analysis['revenue_per_customer'] = segment_analysis['total_revenue'] / segment_analysis['unique_customers']
segment_analysis['orders_per_customer'] = segment_analysis['order_count'] / segment_analysis['unique_customers']

print("\n   📊 SEGMENT PERFORMANCE METRICS:")
print(segment_analysis)

# Find top performing segments
top_revenue_segment = segment_analysis['total_revenue'].idxmax()
top_aov_segment = segment_analysis['avg_order_value'].idxmax()
top_loyalty_segment = segment_analysis['orders_per_customer'].idxmax()

print(f"\n   🏆 TOP PERFORMING SEGMENTS:")
print(f"   • Highest total revenue: {top_revenue_segment} (${segment_analysis.loc[top_revenue_segment, 'total_revenue']:,.0f})")
print(f"   • Highest average order value: {top_aov_segment} (${segment_analysis.loc[top_aov_segment, 'avg_order_value']:.2f})")
print(f"   • Most loyal customers: {top_loyalty_segment} ({segment_analysis.loc[top_loyalty_segment, 'orders_per_customer']:.1f} orders/customer)")

print("\n2️⃣ GEOGRAPHIC PERFORMANCE COMPARISON:")
print("   Purpose: Identify best performing cities and expansion opportunities")

# Geographic analysis
city_analysis = ecommerce_data.groupby('city').agg({
    'total_amount': ['sum', 'mean', 'count'],
    'customer_id': 'nunique',
    'age': 'mean'
}).round(2)

city_analysis.columns = ['total_revenue', 'avg_order_value', 'order_count', 'unique_customers', 'avg_age']
city_analysis['revenue_per_customer'] = city_analysis['total_revenue'] / city_analysis['unique_customers']
city_analysis['market_share'] = (city_analysis['total_revenue'] / city_analysis['total_revenue'].sum()) * 100

print("\n   📊 CITY PERFORMANCE METRICS:")
print(city_analysis)

# Ranking cities
print(f"\n   🏆 CITY RANKINGS:")
print(f"   • By total revenue: {city_analysis['total_revenue'].sort_values(ascending=False).index.tolist()}")
print(f"   • By average order value: {city_analysis['avg_order_value'].sort_values(ascending=False).index.tolist()}")
print(f"   • By revenue per customer: {city_analysis['revenue_per_customer'].sort_values(ascending=False).index.tolist()}")

print("\n3️⃣ PRODUCT CATEGORY PERFORMANCE COMPARISON:")
print("   Purpose: Identify top performing product categories")

# Category analysis
category_analysis = ecommerce_data.groupby('category').agg({
    'total_amount': ['sum', 'mean', 'count'],
    'price': ['mean', 'median'],
    'quantity': ['sum', 'mean']
}).round(2)

category_analysis.columns = ['total_revenue', 'avg_order_value', 'order_count', 
                           'avg_price', 'median_price', 'total_quantity', 'avg_quantity']
category_analysis['market_share'] = (category_analysis['total_revenue'] / category_analysis['total_revenue'].sum()) * 100

print("\n   📊 CATEGORY PERFORMANCE METRICS:")
print(category_analysis)

# Category insights
best_revenue_category = category_analysis['total_revenue'].idxmax()
highest_price_category = category_analysis['avg_price'].idxmax()
most_ordered_category = category_analysis['order_count'].idxmax()

print(f"\n   🏆 CATEGORY INSIGHTS:")
print(f"   • Highest revenue category: {best_revenue_category} (${category_analysis.loc[best_revenue_category, 'total_revenue']:,.0f})")
print(f"   • Highest priced category: {highest_price_category} (${category_analysis.loc[highest_price_category, 'avg_price']:.2f} avg price)")
print(f"   • Most frequently ordered: {most_ordered_category} ({category_analysis.loc[most_ordered_category, 'order_count']:,.0f} orders)")

print("\n4️⃣ TIME-BASED PERFORMANCE COMPARISON:")
print("   Purpose: Compare performance across different time periods")

# Create quarterly comparison
ecommerce_data['quarter'] = ecommerce_data['transaction_date'].dt.quarter
quarterly_performance = ecommerce_data.groupby('quarter')['total_amount'].agg(['sum', 'mean', 'count'])

print("\n   📊 QUARTERLY PERFORMANCE:")
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
for q in quarterly_performance.index:
    if q <= len(quarters):
        quarter_name = quarters[q-1]
        revenue = quarterly_performance.loc[q, 'sum']
        avg_order = quarterly_performance.loc[q, 'mean']
        count = quarterly_performance.loc[q, 'count']
        print(f"   • {quarter_name}: ${revenue:,.0f} total, ${avg_order:.2f} avg, {count:.0f} orders")

print("\n5️⃣ COMPREHENSIVE BUSINESS INSIGHTS:")

# Create executive summary insights
insights = []

# Revenue insights
total_revenue = ecommerce_data['total_amount'].sum()
total_customers = ecommerce_data['customer_id'].nunique()
total_orders = len(ecommerce_data)
avg_customer_value = total_revenue / total_customers

insights.append(f"💰 Total business performance: ${total_revenue:,.0f} revenue from {total_customers:,} customers across {total_orders:,} orders")
insights.append(f"📈 Average customer value: ${avg_customer_value:.2f}")

# Segment insights
premium_share = (segment_analysis.loc['Premium', 'total_revenue'] / total_revenue) * 100
insights.append(f"💎 Premium customers represent {premium_share:.1f}%