# Tier 1: Distribution Analysis

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 25bbbb63-5a6a-47cd-92e0-1453484beff0

---

## Citation
Brandon Deloatch, "Tier 1: Distribution Analysis," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 25bbbb63-5a6a-47cd-92e0-1453484beff0
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import scipy.stats as stats
from scipy.stats import norm, lognorm, gamma, beta, uniform, expon
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print(" Tier 1: Distribution Analysis - Libraries Loaded Successfully!")
print("=" * 65)
print("Available Distribution Analysis Techniques:")
print("• Histogram Analysis - Frequency distribution visualization")
print("• Density Plots - Smooth probability density estimation")
print("• Box Plots - Quartile and outlier analysis")
print("• Violin Plots - Distribution shape and density")
print("• Q-Q Plots - Normality testing and comparison")
print("• Distribution Fitting - Statistical model selection")
print("• Comparative Analysis - Multi-group distributions")

In [None]:
# Generate Comprehensive Business Dataset for Distribution Analysis
np.random.seed(42)

def generate_business_distribution_dataset(n_samples=2000):
 """Generate realistic business dataset with various distribution patterns"""

 # Sales revenue (Log-normal distribution - common in business)
 sales_revenue = np.random.lognormal(mean=10.5, sigma=0.8, size=n_samples)

 # Customer satisfaction (Beta distribution - bounded 0-10)
 satisfaction_raw = np.random.beta(a=2, b=0.5, size=n_samples)
 customer_satisfaction = satisfaction_raw * 10

 # Employee performance (Normal distribution)
 employee_performance = np.random.normal(loc=75, scale=12, size=n_samples)
 employee_performance = np.clip(employee_performance, 0, 100)

 # Website response time (Exponential distribution)
 response_time = np.random.exponential(scale=1.5, size=n_samples)

 # Product prices (Gamma distribution)
 product_prices = np.random.gamma(shape=2, scale=50, size=n_samples)

 # Customer age (Normal with truncation)
 customer_age = np.random.normal(loc=45, scale=15, size=n_samples)
 customer_age = np.clip(customer_age, 18, 80)

 # Transaction amounts (Pareto distribution - 80/20 rule)
 transaction_base = np.random.pareto(a=1.16, size=n_samples) + 1
 transaction_amounts = transaction_base * 100

 # Marketing campaign results (Uniform distribution)
 campaign_effectiveness = np.random.uniform(low=0, high=100, size=n_samples)

 # Bimodal distribution for customer segments
 segment_1 = np.random.normal(loc=30, scale=8, size=n_samples//2)
 segment_2 = np.random.normal(loc=70, scale=10, size=n_samples//2)
 customer_lifetime_value = np.concatenate([segment_1, segment_2])
 np.random.shuffle(customer_lifetime_value)

 # Categorical variables
 regions = np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2])
 business_types = np.random.choice(['B2B', 'B2C', 'Enterprise'], n_samples, p=[0.4, 0.45, 0.15])
 customer_segments = np.random.choice(['Premium', 'Standard', 'Basic'], n_samples, p=[0.2, 0.5, 0.3])

 # Create seasonal patterns in some variables
 time_component = np.arange(n_samples) / n_samples * 4 * np.pi # 2 full cycles
 seasonal_sales = sales_revenue * (1 + 0.2 * np.sin(time_component))

 return pd.DataFrame({
 'sales_revenue': sales_revenue,
 'seasonal_sales': seasonal_sales,
 'customer_satisfaction': customer_satisfaction,
 'employee_performance': employee_performance,
 'response_time': response_time,
 'product_prices': product_prices,
 'customer_age': customer_age,
 'transaction_amounts': transaction_amounts,
 'campaign_effectiveness': campaign_effectiveness,
 'customer_lifetime_value': customer_lifetime_value,
 'region': regions,
 'business_type': business_types,
 'customer_segment': customer_segments
 })

# Generate dataset
print(" Generating business distribution dataset...")
df = generate_business_distribution_dataset(2000)
print(f"Dataset Shape: {df.shape}")
print("\nDataset Overview:")
print(df.head())
print("\nBasic Statistics:")
print(df.describe().round(2))

In [None]:
# 1. BASIC DISTRIBUTION VISUALIZATION
print(" 1. BASIC DISTRIBUTION VISUALIZATION")
print("=" * 38)

# Create comprehensive distribution overview
numerical_vars = ['sales_revenue', 'customer_satisfaction', 'employee_performance',
 'response_time', 'product_prices', 'customer_age']

fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[var.replace('_', ' ').title() for var in numerical_vars],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

positions = [(1,1), (1,2), (2,1), (2,2), (3,1), (3,2)]
colors = ['blue', 'red', 'green', 'orange', 'purple', 'brown']

for i, var in enumerate(numerical_vars):
 row, col = positions[i]

 # Histogram
 fig.add_trace(
 go.Histogram(
 x=df[var],
 nbinsx=30,
 name=var,
 marker_color=colors[i],
 opacity=0.7,
 showlegend=False,
 hovertemplate=f"<b>{var}</b><br>Value: %{{x:.2f}}<br>Count: %{{y}}<extra></extra>"
 ),
 row=row, col=col
 )

fig.update_layout(
 title="Distribution Overview: Key Business Variables",
 height=800,
 showlegend=False
)
fig.show()

# Statistical summary with distribution characteristics
print(" Distribution Characteristics Summary:")
print("=" * 42)

for var in numerical_vars:
 data = df[var]

 # Basic statistics
 mean_val = data.mean()
 median_val = data.median()
 std_val = data.std()
 skewness = stats.skew(data)
 kurtosis = stats.kurtosis(data)

 # Distribution shape analysis
 if abs(skewness) < 0.5:
 skew_desc = "approximately symmetric"
 elif skewness > 0.5:
 skew_desc = "right-skewed (positive)"
 else:
 skew_desc = "left-skewed (negative)"

 if kurtosis > 0:
 kurt_desc = "heavy-tailed (leptokurtic)"
 elif kurtosis < 0:
 kurt_desc = "light-tailed (platykurtic)"
 else:
 kurt_desc = "normal tails (mesokurtic)"

 print(f"\n{var.replace('_', ' ').title()}:")
 print(f" • Mean: {mean_val:.2f} | Median: {median_val:.2f}")
 print(f" • Standard Deviation: {std_val:.2f}")
 print(f" • Skewness: {skewness:.3f} ({skew_desc})")
 print(f" • Kurtosis: {kurtosis:.3f} ({kurt_desc})")
 print(f" • Range: [{data.min():.2f}, {data.max():.2f}]")

In [None]:
# 2. ADVANCED DISTRIBUTION PLOTS
print("\n 2. ADVANCED DISTRIBUTION PLOTS")
print("=" * 36)

# 2.1 Density Plots with Multiple Variables
print("2.1 Kernel Density Estimation:")

# Create density plots for key variables
fig_density = go.Figure()

key_vars = ['sales_revenue', 'customer_satisfaction', 'employee_performance']
colors_density = ['blue', 'red', 'green']

for i, var in enumerate(key_vars):
 # Calculate KDE
 from scipy.stats import gaussian_kde

 data = df[var]
 kde = gaussian_kde(data)
 x_range = np.linspace(data.min(), data.max(), 200)
 density = kde(x_range)

 fig_density.add_trace(
 go.Scatter(
 x=x_range,
 y=density,
 mode='lines',
 fill='tonexty' if i > 0 else 'tozeroy',
 name=var.replace('_', ' ').title(),
 line=dict(color=colors_density[i], width=3),
 opacity=0.6,
 hovertemplate=f"<b>{var}</b><br>Value: %{{x:.2f}}<br>Density: %{{y:.4f}}<extra></extra>"
 )
 )

fig_density.update_layout(
 title="Kernel Density Estimation: Distribution Comparison",
 xaxis_title="Value (Standardized Scale)",
 yaxis_title="Density",
 height=500
)
fig_density.show()

# 2.2 Box Plots with Outlier Analysis
print("\n2.2 Box Plot Analysis:")

# Standardize data for comparison
scaler = StandardScaler()
standardized_data = pd.DataFrame(
 scaler.fit_transform(df[key_vars]),
 columns=key_vars
)

fig_box = go.Figure()

for i, var in enumerate(key_vars):
 fig_box.add_trace(
 go.Box(
 y=standardized_data[var],
 name=var.replace('_', ' ').title(),
 boxpoints='outliers',
 marker_color=colors_density[i],
 hovertemplate=f"<b>{var}</b><br>Value: %{{y:.2f}}<br>Quartile Info:<br>%{{text}}<extra></extra>"
 )
 )

fig_box.update_layout(
 title="Box Plot Analysis: Distribution Comparison (Standardized)",
 yaxis_title="Standardized Value",
 height=500
)
fig_box.show()

# 2.3 Violin Plots
print("\n2.3 Violin Plot Analysis:")

fig_violin = go.Figure()

for i, var in enumerate(key_vars):
 fig_violin.add_trace(
 go.Violin(
 y=standardized_data[var],
 name=var.replace('_', ' ').title(),
 box_visible=True,
 meanline_visible=True,
 fillcolor=colors_density[i],
 opacity=0.6,
 line_color='black'
 )
 )

fig_violin.update_layout(
 title="Violin Plots: Distribution Shape and Density",
 yaxis_title="Standardized Value",
 height=500
)
fig_violin.show()

# Outlier Analysis
print("\n Outlier Analysis Summary:")
for var in key_vars:
 data = df[var]
 Q1 = data.quantile(0.25)
 Q3 = data.quantile(0.75)
 IQR = Q3 - Q1
 lower_bound = Q1 - 1.5 * IQR
 upper_bound = Q3 + 1.5 * IQR

 outliers = data[(data < lower_bound) | (data > upper_bound)]
 outlier_percentage = (len(outliers) / len(data)) * 100

 print(f"• {var.replace('_', ' ').title()}: {len(outliers)} outliers ({outlier_percentage:.1f}%)")
 if len(outliers) > 0:
 print(f" Range: [{outliers.min():.2f}, {outliers.max():.2f}]")

In [None]:
# 3. NORMALITY TESTING AND Q-Q PLOTS
print("\n 3. NORMALITY TESTING AND Q-Q PLOTS")
print("=" * 40)

def perform_normality_tests(data, variable_name):
 """Perform comprehensive normality testing"""

 # Shapiro-Wilk test (sample size <= 5000)
 if len(data) <= 5000:
 shapiro_stat, shapiro_p = stats.shapiro(data)
 else:
 shapiro_stat, shapiro_p = None, None

 # Kolmogorov-Smirnov test
 ks_stat, ks_p = stats.kstest(data, 'norm', args=(data.mean(), data.std()))

 # Anderson-Darling test
 ad_stat, ad_critical, ad_significance = stats.anderson(data, dist='norm')

 # Jarque-Bera test
 jb_stat, jb_p = stats.jarque_bera(data)

 return {
 'variable': variable_name,
 'shapiro_stat': shapiro_stat,
 'shapiro_p': shapiro_p,
 'ks_stat': ks_stat,
 'ks_p': ks_p,
 'ad_stat': ad_stat,
 'jb_stat': jb_stat,
 'jb_p': jb_p,
 'is_normal': all([
 shapiro_p > 0.05 if shapiro_p is not None else True,
 ks_p > 0.05,
 jb_p > 0.05
 ])
 }

# Test normality for all numerical variables
normality_results = []
for var in numerical_vars:
 result = perform_normality_tests(df[var], var)
 normality_results.append(result)

# Create Q-Q plots for normality assessment
fig_qq = make_subplots(
 rows=3, cols=2,
 subplot_titles=[f"Q-Q Plot: {var.replace('_', ' ').title()}" for var in numerical_vars]
)

for i, var in enumerate(numerical_vars):
 row = (i // 2) + 1
 col = (i % 2) + 1

 # Generate Q-Q plot data
 data = df[var]
 standardized_data = (data - data.mean()) / data.std()

 # Theoretical quantiles (normal distribution)
 n = len(data)
 theoretical_quantiles = stats.norm.ppf(np.arange(1, n + 1) / (n + 1))

 # Sort sample quantiles
 sample_quantiles = np.sort(standardized_data)

 # Add Q-Q plot
 fig_qq.add_trace(
 go.Scatter(
 x=theoretical_quantiles,
 y=sample_quantiles,
 mode='markers',
 marker=dict(size=4, opacity=0.6),
 name=var,
 showlegend=False,
 hovertemplate=f"<b>{var}</b><br>Theoretical: %{{x:.2f}}<br>Sample: %{{y:.2f}}<extra></extra>"
 ),
 row=row, col=col
 )

 # Add reference line (perfect normal distribution)
 min_val = min(theoretical_quantiles.min(), sample_quantiles.min())
 max_val = max(theoretical_quantiles.max(), sample_quantiles.max())

 fig_qq.add_trace(
 go.Scatter(
 x=[min_val, max_val],
 y=[min_val, max_val],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Perfect Normal',
 showlegend=(i == 0)
 ),
 row=row, col=col
 )

fig_qq.update_layout(
 title="Q-Q Plots: Normality Assessment",
 height=800
)
fig_qq.show()

# Display normality test results
print(" Normality Test Results:")
print("=" * 26)

normality_df = pd.DataFrame(normality_results)
print("Variable | Shapiro-Wilk p | K-S p | Jarque-Bera p | Is Normal?")
print("-" * 65)

for _, row in normality_df.iterrows():
 var_name = row['variable'].replace('_', ' ').title()[:12].ljust(12)
 shapiro_p = f"{row['shapiro_p']:.4f}" if row['shapiro_p'] is not None else "N/A"
 ks_p = f"{row['ks_p']:.4f}"
 jb_p = f"{row['jb_p']:.4f}"
 is_normal = "" if row['is_normal'] else ""

 print(f"{var_name} | {shapiro_p:>13} | {ks_p:>5} | {jb_p:>11} | {is_normal:>9}")

print(f"\n Normality Summary:")
normal_vars = [r['variable'] for r in normality_results if r['is_normal']]
non_normal_vars = [r['variable'] for r in normality_results if not r['is_normal']]

print(f"• Normal distributions: {len(normal_vars)} variables")
if normal_vars:
 print(f" - {', '.join([v.replace('_', ' ').title() for v in normal_vars])}")

print(f"• Non-normal distributions: {len(non_normal_vars)} variables")
if non_normal_vars:
 print(f" - {', '.join([v.replace('_', ' ').title() for v in non_normal_vars])}")

In [None]:
# 4. DISTRIBUTION FITTING AND MODEL SELECTION
print("\n 4. DISTRIBUTION FITTING AND MODEL SELECTION")
print("=" * 48)

def fit_multiple_distributions(data, distributions=None):
 """Fit multiple distributions and return best fit based on AIC/BIC"""

 if distributions is None:
 distributions = [
 ('Normal', stats.norm),
 ('Log-Normal', stats.lognorm),
 ('Exponential', stats.expon),
 ('Gamma', stats.gamma),
 ('Beta', stats.beta),
 ('Uniform', stats.uniform)
 ]

 results = []

 for name, distribution in distributions:
 try:
 # Fit distribution
 if name == 'Beta':
 # Beta distribution needs data scaled to [0,1]
 data_scaled = (data - data.min()) / (data.max() - data.min())
 params = distribution.fit(data_scaled)
 # Scale back for evaluation
 fitted_data = data_scaled
 else:
 params = distribution.fit(data)
 fitted_data = data

 # Calculate log-likelihood
 log_likelihood = np.sum(distribution.logpdf(fitted_data, *params))

 # Calculate AIC and BIC
 k = len(params) # number of parameters
 n = len(data) # sample size
 aic = 2 * k - 2 * log_likelihood
 bic = k * np.log(n) - 2 * log_likelihood

 # Kolmogorov-Smirnov test
 if name == 'Beta':
 ks_stat, ks_p = stats.kstest(fitted_data,
 lambda x: distribution.cdf(x, *params))
 else:
 ks_stat, ks_p = stats.kstest(data,
 lambda x: distribution.cdf(x, *params))

 results.append({
 'distribution': name,
 'parameters': params,
 'log_likelihood': log_likelihood,
 'aic': aic,
 'bic': bic,
 'ks_statistic': ks_stat,
 'ks_p_value': ks_p
 })

 except Exception as e:
 print(f"Failed to fit {name}: {e}")
 continue

 # Sort by AIC (lower is better)
 results.sort(key=lambda x: x['aic'])
 return results

# Fit distributions for key variables
variables_to_fit = ['sales_revenue', 'response_time', 'customer_satisfaction']

fitting_results = {}
for var in variables_to_fit:
 print(f"\n Fitting distributions for {var.replace('_', ' ').title()}:")

 data = df[var].values
 # Remove any infinite or NaN values
 data = data[np.isfinite(data)]

 results = fit_multiple_distributions(data)
 fitting_results[var] = results

 print("Rank | Distribution | AIC | BIC | K-S p-value")
 print("-" * 50)

 for i, result in enumerate(results[:5]): # Show top 5
 rank = i + 1
 dist_name = result['distribution'][:12].ljust(12)
 aic = f"{result['aic']:.1f}"
 bic = f"{result['bic']:.1f}"
 ks_p = f"{result['ks_p_value']:.4f}"

 print(f"{rank:>4} | {dist_name} | {aic:>7} | {bic:>7} | {ks_p:>11}")

# Visualize best-fitting distributions
fig_fits = make_subplots(
 rows=1, cols=3,
 subplot_titles=[var.replace('_', ' ').title() for var in variables_to_fit]
)

for i, var in enumerate(variables_to_fit):
 col = i + 1
 data = df[var].values
 data = data[np.isfinite(data)]

 # Plot histogram
 fig_fits.add_trace(
 go.Histogram(
 x=data,
 nbinsx=30,
 histnorm='probability density',
 name=f'{var} Data',
 marker_color='lightblue',
 opacity=0.7,
 showlegend=(i == 0)
 ),
 row=1, col=col
 )

 # Plot best-fitting distribution
 best_fit = fitting_results[var][0]
 dist_name = best_fit['distribution']
 params = best_fit['parameters']

 # Generate distribution curve
 if dist_name == 'Beta':
 data_scaled = (data - data.min()) / (data.max() - data.min())
 x_range = np.linspace(0, 1, 200)
 y_range = getattr(stats, dist_name.lower().replace('-', '')).pdf(x_range, *params)
 # Scale back x_range
 x_range = x_range * (data.max() - data.min()) + data.min()
 else:
 x_range = np.linspace(data.min(), data.max(), 200)
 y_range = getattr(stats, dist_name.lower().replace('-', '')).pdf(x_range, *params)

 fig_fits.add_trace(
 go.Scatter(
 x=x_range,
 y=y_range,
 mode='lines',
 name=f'Best Fit: {dist_name}',
 line=dict(color='red', width=3),
 showlegend=(i == 0)
 ),
 row=1, col=col
 )

fig_fits.update_layout(
 title="Distribution Fitting: Best Models vs Actual Data",
 height=500
)
fig_fits.show()

# Summary of best fits
print(f"\n Best Distribution Fits Summary:")
print("=" * 35)

for var in variables_to_fit:
 best_fit = fitting_results[var][0]
 var_name = var.replace('_', ' ').title()
 dist_name = best_fit['distribution']
 aic = best_fit['aic']
 ks_p = best_fit['ks_p_value']

 goodness = "Excellent" if ks_p > 0.1 else "Good" if ks_p > 0.05 else "Fair" if ks_p > 0.01 else "Poor"

 print(f"• {var_name}: {dist_name} distribution")
 print(f" - AIC: {aic:.1f}")
 print(f" - Goodness of fit: {goodness} (p={ks_p:.4f})")

In [None]:
# 5. GROUP COMPARISONS AND CATEGORICAL ANALYSIS
print("\n 5. GROUP COMPARISONS AND CATEGORICAL ANALYSIS")
print("=" * 52)

# 5.1 Distribution by categorical variables
print("5.1 Regional Distribution Analysis:")

# Sales revenue by region
fig_region = go.Figure()

regions = df['region'].unique()
colors_region = ['red', 'blue', 'green', 'orange']

for i, region in enumerate(regions):
 region_data = df[df['region'] == region]['sales_revenue']

 fig_region.add_trace(
 go.Histogram(
 x=region_data,
 name=region,
 opacity=0.7,
 nbinsx=25,
 histnorm='probability density',
 marker_color=colors_region[i]
 )
 )

fig_region.update_layout(
 title="Sales Revenue Distribution by Region",
 xaxis_title="Sales Revenue ($)",
 yaxis_title="Probability Density",
 barmode='overlay',
 height=500
)
fig_region.show()

# 5.2 Business type comparison
print("\n5.2 Business Type Distribution Analysis:")

# Create side-by-side box plots
fig_business = go.Figure()

business_types = df['business_type'].unique()
colors_business = ['lightblue', 'lightgreen', 'lightcoral']

for i, btype in enumerate(business_types):
 btype_data = df[df['business_type'] == btype]['customer_satisfaction']

 fig_business.add_trace(
 go.Box(
 y=btype_data,
 name=btype,
 marker_color=colors_business[i],
 boxpoints='outliers'
 )
 )

fig_business.update_layout(
 title="Customer Satisfaction Distribution by Business Type",
 yaxis_title="Customer Satisfaction (1-10)",
 height=500
)
fig_business.show()

# 5.3 Statistical comparison tests
print("\n5.3 Statistical Comparison Tests:")

# ANOVA test for regional differences in sales
regional_groups = [df[df['region'] == region]['sales_revenue'] for region in regions]
f_stat, p_value = stats.f_oneway(*regional_groups)

print(f"Regional Sales Revenue ANOVA:")
print(f"• F-statistic: {f_stat:.3f}")
print(f"• P-value: {p_value:.6f}")
print(f"• Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

# Post-hoc analysis (Tukey's HSD)
if p_value < 0.05:
 from scipy.stats import tukey_hsd

 # Prepare data for Tukey's test
 sales_data = []
 region_labels = []

 for region in regions:
 region_sales = df[df['region'] == region]['sales_revenue']
 sales_data.extend(region_sales)
 region_labels.extend([region] * len(region_sales))

 # Perform Tukey's HSD test
 print(f"\nTukey's HSD Post-hoc Analysis:")
 regional_means = df.groupby('region')['sales_revenue'].mean().sort_values(ascending=False)
 print("Regional Sales Ranking (highest to lowest):")
 for i, (region, mean_sales) in enumerate(regional_means.items(), 1):
 print(f" {i}. {region}: ${mean_sales:,.0f}")

# Customer segment analysis
print(f"\n5.4 Customer Segment Analysis:")

# Compare customer lifetime value across segments
segments = df['customer_segment'].unique()
segment_stats = df.groupby('customer_segment')['customer_lifetime_value'].agg([
 'count', 'mean', 'std', 'median'
]).round(2)

print("Customer Lifetime Value by Segment:")
print(segment_stats)

# Create violin plot for segment comparison
fig_segments = go.Figure()

for i, segment in enumerate(segments):
 segment_data = df[df['customer_segment'] == segment]['customer_lifetime_value']

 fig_segments.add_trace(
 go.Violin(
 y=segment_data,
 name=segment,
 box_visible=True,
 meanline_visible=True,
 fillcolor=colors_business[i % len(colors_business)],
 opacity=0.6
 )
 )

fig_segments.update_layout(
 title="Customer Lifetime Value Distribution by Segment",
 yaxis_title="Customer Lifetime Value ($)",
 height=500
)
fig_segments.show()

# Kruskal-Wallis test (non-parametric alternative to ANOVA)
segment_groups = [df[df['customer_segment'] == seg]['customer_lifetime_value'] for seg in segments]
kw_stat, kw_p = stats.kruskal(*segment_groups)

print(f"\nCustomer Segment Lifetime Value Comparison (Kruskal-Wallis):")
print(f"• H-statistic: {kw_stat:.3f}")
print(f"• P-value: {kw_p:.6f}")
print(f"• Significant difference: {'Yes' if kw_p < 0.05 else 'No'}")

In [None]:
# 6. BIMODAL AND MULTIMODAL DISTRIBUTION ANALYSIS
print("\n 6. BIMODAL AND MULTIMODAL DISTRIBUTION ANALYSIS")
print("=" * 52)

# Analyze the bimodal customer lifetime value distribution
def detect_multimodality(data, method='dip'):
 """Detect multimodality in data"""

 # Hartigan's dip test for unimodality
 try:
 from diptest import diptest
 dip_stat, dip_p = diptest(data)
 is_multimodal_dip = dip_p < 0.05
 except ImportError:
 print("diptest not available, using alternative method")
 dip_stat, dip_p = None, None
 is_multimodal_dip = None

 # Alternative: Count peaks in KDE
 from scipy.stats import gaussian_kde
 from scipy.signal import find_peaks

 kde = gaussian_kde(data)
 x_range = np.linspace(data.min(), data.max(), 1000)
 density = kde(x_range)

 # Find peaks
 peaks, _ = find_peaks(density, height=0.001, distance=50)
 n_peaks = len(peaks)

 return {
 'dip_statistic': dip_stat,
 'dip_p_value': dip_p,
 'is_multimodal_dip': is_multimodal_dip,
 'n_peaks': n_peaks,
 'is_multimodal_peaks': n_peaks > 1,
 'peak_locations': x_range[peaks] if n_peaks > 0 else []
 }

# Analyze customer lifetime value for multimodality
clv_data = df['customer_lifetime_value'].values
multimodal_results = detect_multimodality(clv_data)

print(" Multimodality Analysis - Customer Lifetime Value:")
print(f"• Number of peaks detected: {multimodal_results['n_peaks']}")
print(f"• Is multimodal (peaks): {'Yes' if multimodal_results['is_multimodal_peaks'] else 'No'}")

if multimodal_results['dip_p_value'] is not None:
 print(f"• Dip test p-value: {multimodal_results['dip_p_value']:.6f}")
 print(f"• Is multimodal (dip test): {'Yes' if multimodal_results['is_multimodal_dip'] else 'No'}")

if multimodal_results['peak_locations'].size > 0:
 print(f"• Peak locations: {multimodal_results['peak_locations'].round(2)}")

# Visualize multimodal distribution
fig_multimodal = go.Figure()

# Histogram
fig_multimodal.add_trace(
 go.Histogram(
 x=clv_data,
 nbinsx=40,
 histnorm='probability density',
 name='Customer LTV',
 marker_color='lightblue',
 opacity=0.7
 )
)

# KDE overlay
from scipy.stats import gaussian_kde
kde = gaussian_kde(clv_data)
x_range = np.linspace(clv_data.min(), clv_data.max(), 1000)
density = kde(x_range)

fig_multimodal.add_trace(
 go.Scatter(
 x=x_range,
 y=density,
 mode='lines',
 name='KDE',
 line=dict(color='red', width=3)
 )
)

# Mark peaks
if multimodal_results['peak_locations'].size > 0:
 peak_densities = kde(multimodal_results['peak_locations'])

 fig_multimodal.add_trace(
 go.Scatter(
 x=multimodal_results['peak_locations'],
 y=peak_densities,
 mode='markers',
 name='Detected Peaks',
 marker=dict(color='green', size=12, symbol='star')
 )
 )

fig_multimodal.update_layout(
 title="Multimodal Distribution Analysis: Customer Lifetime Value",
 xaxis_title="Customer Lifetime Value ($)",
 yaxis_title="Probability Density",
 height=500
)
fig_multimodal.show()

# Mixture model fitting for bimodal distribution
print("\n6.1 Gaussian Mixture Model Analysis:")

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import cross_val_score

# Fit Gaussian Mixture Models with different numbers of components
n_components_range = range(1, 6)
aic_scores = []
bic_scores = []
log_likelihoods = []

clv_reshaped = clv_data.reshape(-1, 1)

for n_components in n_components_range:
 gmm = GaussianMixture(n_components=n_components, random_state=42)
 gmm.fit(clv_reshaped)

 aic_scores.append(gmm.aic(clv_reshaped))
 bic_scores.append(gmm.bic(clv_reshaped))
 log_likelihoods.append(gmm.score(clv_reshaped))

# Find optimal number of components
optimal_components_aic = n_components_range[np.argmin(aic_scores)]
optimal_components_bic = n_components_range[np.argmin(bic_scores)]

print(f"• Optimal components (AIC): {optimal_components_aic}")
print(f"• Optimal components (BIC): {optimal_components_bic}")

# Fit final model with optimal components
optimal_n = optimal_components_bic # BIC is more conservative
final_gmm = GaussianMixture(n_components=optimal_n, random_state=42)
final_gmm.fit(clv_reshaped)

# Get component parameters
print(f"\nGaussian Mixture Model Components (n={optimal_n}):")
for i in range(optimal_n):
 mean = final_gmm.means_[i][0]
 std = np.sqrt(final_gmm.covariances_[i][0][0])
 weight = final_gmm.weights_[i]

 print(f"• Component {i+1}: Mean={mean:.2f}, Std={std:.2f}, Weight={weight:.3f}")

# Predict component memberships
component_labels = final_gmm.predict(clv_reshaped)
component_probs = final_gmm.predict_proba(clv_reshaped)

# Visualize mixture components
fig_mixture = go.Figure()

# Original data
fig_mixture.add_trace(
 go.Histogram(
 x=clv_data,
 nbinsx=40,
 histnorm='probability density',
 name='Original Data',
 marker_color='lightgray',
 opacity=0.5
 )
)

# Plot each component
colors_comp = ['red', 'blue', 'green', 'purple', 'orange']
x_plot = np.linspace(clv_data.min(), clv_data.max(), 1000)

for i in range(optimal_n):
 mean = final_gmm.means_[i][0]
 std = np.sqrt(final_gmm.covariances_[i][0][0])
 weight = final_gmm.weights_[i]

 component_density = weight * stats.norm.pdf(x_plot, mean, std)

 fig_mixture.add_trace(
 go.Scatter(
 x=x_plot,
 y=component_density,
 mode='lines',
 name=f'Component {i+1}',
 line=dict(color=colors_comp[i], width=2, dash='dash')
 )
 )

# Overall mixture
mixture_density = np.sum([
 final_gmm.weights_[i] * stats.norm.pdf(x_plot, final_gmm.means_[i][0],
 np.sqrt(final_gmm.covariances_[i][0][0]))
 for i in range(optimal_n)
], axis=0)

fig_mixture.add_trace(
 go.Scatter(
 x=x_plot,
 y=mixture_density,
 mode='lines',
 name='Mixture Model',
 line=dict(color='black', width=3)
 )
)

fig_mixture.update_layout(
 title=f"Gaussian Mixture Model: {optimal_n} Components",
 xaxis_title="Customer Lifetime Value ($)",
 yaxis_title="Probability Density",
 height=500
)
fig_mixture.show()

# Business interpretation of components
if optimal_n > 1:
 print(f"\n Business Interpretation:")
 component_means = [final_gmm.means_[i][0] for i in range(optimal_n)]
 sorted_components = np.argsort(component_means)

 component_names = ['Low-Value', 'Mid-Value', 'High-Value', 'Premium', 'Ultra-Premium'][:optimal_n]

 for i, comp_idx in enumerate(sorted_components):
 mean_val = final_gmm.means_[comp_idx][0]
 weight = final_gmm.weights_[comp_idx]
 n_customers = int(weight * len(df))

 print(f"• {component_names[i]} Customers: {n_customers:,} customers ({weight:.1%})")
 print(f" - Average LTV: ${mean_val:.0f}")

In [None]:
# 7. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print("\n 7. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 58)

def generate_distribution_insights(dataframe, fitting_results, multimodal_results):
 """Generate comprehensive business insights from distribution analysis"""

 insights = {
 'distribution_characteristics': [],
 'risk_assessment': [],
 'optimization_opportunities': [],
 'customer_segmentation': [],
 'operational_insights': []
 }

 # Distribution characteristics insights
 sales_revenue_stats = dataframe['sales_revenue'].describe()
 sales_cv = dataframe['sales_revenue'].std() / dataframe['sales_revenue'].mean()

 insights['distribution_characteristics'].append(
 f"Sales revenue shows high variability (CV={sales_cv:.2f}) indicating diverse business performance"
 )

 # Customer satisfaction analysis
 satisfaction_skew = stats.skew(dataframe['customer_satisfaction'])
 if satisfaction_skew > 0.5:
 insights['optimization_opportunities'].append(
 "Customer satisfaction is right-skewed - focus on improving low-satisfaction outliers"
 )
 elif satisfaction_skew < -0.5:
 insights['optimization_opportunities'].append(
 "Customer satisfaction is left-skewed - most customers are highly satisfied"
 )

 # Response time analysis
 if 'response_time' in fitting_results:
 best_fit = fitting_results['response_time'][0]
 if best_fit['distribution'] == 'Exponential':
 insights['operational_insights'].append(
 "Response time follows exponential distribution - typical of system performance metrics"
 )

 # Multimodal customer analysis
 if multimodal_results['is_multimodal_peaks']:
 insights['customer_segmentation'].append(
 f"Customer lifetime value shows {multimodal_results['n_peaks']} distinct segments - "
 "strong evidence for targeted marketing strategies"
 )

 # Risk assessment
 revenue_outliers = dataframe['sales_revenue'].quantile(0.95)
 high_revenue_prop = (dataframe['sales_revenue'] > revenue_outliers).mean()

 insights['risk_assessment'].append(
 f"Top 5% of businesses generate disproportionate revenue - "
 f"concentration risk in {high_revenue_prop:.1%} of customer base"
 )

 return insights

# Generate insights
distribution_insights = generate_distribution_insights(df, fitting_results, multimodal_results)

# Create comprehensive business dashboard
fig_dashboard = make_subplots(
 rows=2, cols=2,
 subplot_titles=(
 "Revenue Distribution Risk Profile",
 "Customer Satisfaction Performance",
 "Operational Efficiency Metrics",
 "Customer Segment Value Analysis"
 ),
 specs=[[{"type": "scatter"}, {"type": "bar"}],
 [{"type": "scatter"}, {"type": "bar"}]]
)

# 1. Revenue Distribution Risk Profile
revenue_percentiles = np.percentile(df['sales_revenue'], [10, 25, 50, 75, 90, 95, 99])
percentile_labels = ['P10', 'P25', 'P50', 'P75', 'P90', 'P95', 'P99']

fig_dashboard.add_trace(
 go.Scatter(
 x=percentile_labels,
 y=revenue_percentiles,
 mode='markers+lines',
 marker=dict(size=10, color='red'),
 line=dict(color='red', width=3),
 name='Revenue Percentiles'
 ),
 row=1, col=1
)

# 2. Customer Satisfaction Performance
satisfaction_ranges = pd.cut(df['customer_satisfaction'],
 bins=[0, 5, 7, 8.5, 10],
 labels=['Poor', 'Fair', 'Good', 'Excellent'])
satisfaction_counts = satisfaction_ranges.value_counts()

fig_dashboard.add_trace(
 go.Bar(
 x=satisfaction_counts.index,
 y=satisfaction_counts.values,
 marker_color=['red', 'orange', 'lightgreen', 'darkgreen'],
 name='Satisfaction Distribution'
 ),
 row=1, col=2
)

# 3. Operational Efficiency Metrics
response_time_bins = pd.cut(df['response_time'],
 bins=[0, 1, 2, 3, np.inf],
 labels=['Excellent (<1s)', 'Good (1-2s)', 'Fair (2-3s)', 'Poor (>3s)'])
response_counts = response_time_bins.value_counts()

fig_dashboard.add_trace(
 go.Bar(
 x=response_counts.index,
 y=response_counts.values,
 marker_color=['darkgreen', 'lightgreen', 'orange', 'red'],
 name='Response Time Distribution'
 ),
 row=2, col=1
)

# 4. Customer Segment Value Analysis
if multimodal_results['n_peaks'] > 1:
 # Use mixture model components
 component_labels = final_gmm.predict(df['customer_lifetime_value'].values.reshape(-1, 1))
 component_counts = pd.Series(component_labels).value_counts().sort_index()
 component_names = [f'Segment {i+1}' for i in range(len(component_counts))]

 fig_dashboard.add_trace(
 go.Bar(
 x=component_names,
 y=component_counts.values,
 marker_color=['lightblue', 'blue', 'darkblue'][:len(component_counts)],
 name='Customer Segments'
 ),
 row=2, col=2
 )

fig_dashboard.update_layout(
 title="Distribution Analysis Business Dashboard",
 height=700,
 showlegend=False
)
fig_dashboard.show()

# Display strategic insights
print(" Strategic Business Insights:")
print("=" * 30)

for category, insights_list in distribution_insights.items():
 if insights_list:
 print(f"\n{category.replace('_', ' ').title()}:")
 for i, insight in enumerate(insights_list, 1):
 print(f" {i}. {insight}")

# Risk and opportunity assessment
print(f"\n Risk and Opportunity Assessment:")
print("=" * 35)

# Revenue concentration analysis
revenue_80th = df['sales_revenue'].quantile(0.8)
top_20_percent = df[df['sales_revenue'] >= revenue_80th]
revenue_concentration = top_20_percent['sales_revenue'].sum() / df['sales_revenue'].sum()

print(f"• Revenue Concentration: Top 20% of businesses generate {revenue_concentration:.1%} of total revenue")

# Customer satisfaction risk
low_satisfaction = (df['customer_satisfaction'] < 6).mean()
print(f"• Customer Risk: {low_satisfaction:.1%} of customers have satisfaction scores below 6/10")

# Operational efficiency
slow_responses = (df['response_time'] > 3).mean()
print(f"• Operational Risk: {slow_responses:.1%} of responses exceed 3-second threshold")

# Growth opportunities
if multimodal_results['n_peaks'] > 1:
 low_value_segment = (component_labels == 0).mean() # Assuming component 0 is lowest
 print(f"• Growth Opportunity: {low_value_segment:.1%} of customers in low-value segment - potential for upselling")

print(f"\n Strategic Recommendations:")
print("1. Implement customer success program for low-satisfaction segments")
print("2. Develop premium services for high-value customer segments")
print("3. Optimize system performance to reduce response time variability")
print("4. Create targeted marketing campaigns based on identified customer segments")
print("5. Monitor revenue concentration to reduce dependency on top customers")

# LEARNING SUMMARY: Distribution Analysis

## Key Concepts Mastered

### 1. **Distribution Fundamentals**
- **Shape Analysis**: Understanding skewness, kurtosis, and distribution characteristics
- **Central Tendency**: Mean vs median relationships in different distributions
- **Variability Measures**: Standard deviation, coefficient of variation, and range analysis
- **Outlier Detection**: Statistical methods for identifying extreme values

### 2. **Statistical Testing**
- **Normality Tests**: Shapiro-Wilk, Kolmogorov-Smirnov, Jarque-Bera, Anderson-Darling
- **Q-Q Plots**: Visual assessment of distribution fit against theoretical distributions
- **Distribution Fitting**: AIC/BIC model selection for optimal distribution choice
- **Goodness of Fit**: Statistical validation of distribution models

### 3. **Advanced Distribution Analysis**
- **Multimodal Detection**: Identifying multiple peaks and customer segments
- **Mixture Models**: Gaussian mixture modeling for complex distributions
- **Group Comparisons**: ANOVA, Kruskal-Wallis for categorical analysis
- **Business Intelligence**: Converting statistical insights into actionable strategies

## Business Applications

### Risk Assessment
- **Revenue Concentration**: Understanding dependency on high-value customers
- **Performance Variability**: Identifying operational inconsistencies
- **Customer Risk**: Quantifying satisfaction and retention risks
- **Quality Control**: Using distribution analysis for process monitoring

### Strategic Planning
- Distribution analysis enables:
 - Customer segmentation through multimodal analysis
 - Resource allocation based on distribution characteristics
 - Performance benchmarking against theoretical models
 - Risk mitigation through outlier identification

## Next Steps

1. **Tier 2: Predictive Modeling** - Use distribution insights for forecasting
2. **Hypothesis Testing** - Build on distribution knowledge for statistical inference
3. **Time Series Analysis** - Understand how distributions evolve over time
4. **Multivariate Analysis** - Extend to joint distributions and correlations

## Pro Tips

- Always visualize distributions before applying statistical tests
- Use multiple methods to confirm distribution characteristics
- Consider business context when interpreting statistical results
- Multimodal distributions often reveal hidden customer segments
- Non-normal distributions may require specialized analytical approaches

## Common Pitfalls

- **Assumption Violations**: Applying normal-distribution methods to non-normal data
- **Sample Size Effects**: Small samples can lead to unreliable distribution fitting
- **Outlier Influence**: Extreme values can distort distribution shape analysis
- **Over-interpretation**: Statistical significance doesn't always mean practical significance

**Remember**: *Distribution analysis is the foundation for understanding your data's story - let the shape guide your analytical strategy!*