# Week 7: Advanced Statistics for Big Data

## Overview
Master statistical analysis techniques for large-scale marketing data, including hypothesis testing, Bayesian methods, and time series analysis at scale.

## Learning Objectives
- Perform statistical tests on massive samples
- Handle challenges of large sample sizes
- Implement bootstrap and resampling at scale
- Conduct power analysis for large experiments
- Apply Bayesian statistics to marketing problems
- Analyze time series data at scale

## Prerequisites
- Strong statistics foundation
- Redshift cluster access
- Large marketing dataset (10M+ rows)
- Understanding of hypothesis testing

## Table of Contents
1. [Setup and Environment](#setup)
2. [Statistical Tests on Large Samples](#large-samples)
3. [Handling Massive Sample Sizes](#massive-samples)
4. [Bootstrap and Resampling at Scale](#bootstrap)
5. [Power Analysis for Large Experiments](#power-analysis)
6. [Bayesian Statistics for Marketing](#bayesian)
7. [Time Series Analysis at Scale](#timeseries)
8. [Real-World Project: 50M Conversion Analysis](#project)
9. [Exercises](#exercises)

## 1. Setup and Environment <a name="setup"></a>

In [None]:
# Install required packages
!pip install -q scipy statsmodels scikit-learn
!pip install -q pymc3 arviz  # Bayesian analysis
!pip install -q prophet  # Time series
!pip install -q pandas numpy redshift_connector
!pip install -q plotly seaborn matplotlib

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import norm, beta, gamma, ttest_ind, chi2_contingency
import statsmodels.api as sm
from statsmodels.stats.power import TTestIndPower, TTestPower
from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, acf, pacf
import pymc3 as pm
import arviz as az
from prophet import Prophet
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import redshift_connector
from sklearn.utils import resample
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Libraries imported successfully")

### Redshift Connection

In [None]:
import os
from getpass import getpass

class RedshiftStatsAnalyzer:
    """
    Statistical analysis framework for Redshift data
    """
    
    def __init__(self, host, port, database, user, password):
        self.config = {
            'host': host,
            'port': int(port),
            'database': database,
            'user': user,
            'password': password
        }
        self.conn = None
    
    def connect(self):
        """Establish connection"""
        self.conn = redshift_connector.connect(**self.config)
        print(f"✓ Connected to {self.config['database']}")
        return self.conn
    
    def query(self, sql):
        """Execute query and return DataFrame"""
        if not self.conn:
            self.connect()
        
        cursor = self.conn.cursor()
        cursor.execute(sql)
        
        result = cursor.fetchall()
        columns = [desc[0] for desc in cursor.description]
        
        return pd.DataFrame(result, columns=columns)
    
    def get_sample_for_analysis(self, table, size=100000, stratify_col=None):
        """
        Get representative sample for statistical analysis
        """
        if stratify_col:
            # Stratified sampling
            query = f"""
            WITH strata AS (
                SELECT 
                    {stratify_col},
                    COUNT(*) as cnt,
                    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as pct
                FROM {table}
                GROUP BY {stratify_col}
            )
            SELECT t.*
            FROM {table} t
            JOIN strata s ON t.{stratify_col} = s.{stratify_col}
            WHERE RANDOM() < (s.pct / 100.0) * {size} / s.cnt
            LIMIT {size}
            """
        else:
            # Simple random sampling
            query = f"""
            SELECT *
            FROM {table}
            ORDER BY RANDOM()
            LIMIT {size}
            """
        
        return self.query(query)

# Initialize (update with your credentials)
REDSHIFT_HOST = os.getenv('REDSHIFT_HOST', 'your-cluster.redshift.amazonaws.com')
REDSHIFT_PORT = os.getenv('REDSHIFT_PORT', '5439')
REDSHIFT_DB = os.getenv('REDSHIFT_DB', 'marketing_db')
REDSHIFT_USER = os.getenv('REDSHIFT_USER', input('Redshift username: '))
REDSHIFT_PASSWORD = os.getenv('REDSHIFT_PASSWORD', getpass('Redshift password: '))

rs_stats = RedshiftStatsAnalyzer(
    REDSHIFT_HOST, REDSHIFT_PORT, REDSHIFT_DB,
    REDSHIFT_USER, REDSHIFT_PASSWORD
)

## 2. Statistical Tests on Large Samples <a name="large-samples"></a>

### Challenge: Everything is Significant

With large samples (100k+), even tiny effects become statistically significant. We need to focus on **practical significance** (effect size) rather than just statistical significance (p-value).

In [None]:
def demonstrate_large_sample_problem():
    """
    Demonstrate how large samples make everything significant
    """
    
    results = []
    
    # Test with increasing sample sizes
    for n in [100, 1000, 10000, 100000, 1000000]:
        # Group A: mean = 100, std = 15
        # Group B: mean = 100.5, std = 15 (tiny 0.5 difference!)
        group_a = np.random.normal(100, 15, n)
        group_b = np.random.normal(100.5, 15, n)
        
        # T-test
        t_stat, p_value = ttest_ind(group_a, group_b)
        
        # Effect size (Cohen's d)
        cohens_d = (group_b.mean() - group_a.mean()) / np.sqrt(
            (group_a.std()**2 + group_b.std()**2) / 2
        )
        
        results.append({
            'sample_size': n,
            'mean_diff': group_b.mean() - group_a.mean(),
            'p_value': p_value,
            'cohens_d': cohens_d,
            'significant': p_value < 0.05
        })
    
    df_results = pd.DataFrame(results)
    
    print("Impact of Sample Size on Statistical Significance")
    print("="*60)
    print(df_results.to_string(index=False))
    print("\nNote: Same tiny effect (0.5 difference), but significance")
    print("depends on sample size, not practical importance!")
    
    return df_results

# demonstrate_large_sample_problem()

### Solution: Effect Size + Confidence Intervals

In [None]:
class LargeSampleTesting:
    """
    Statistical testing framework for large samples
    Emphasizes effect size and practical significance
    """
    
    @staticmethod
    def two_sample_test(group_a, group_b, metric_name="metric", 
                        min_effect_size=0.02):
        """
        Complete two-sample test with effect size analysis
        
        Args:
            group_a, group_b: Arrays of values
            metric_name: Name of metric being tested
            min_effect_size: Minimum practical effect size (Cohen's d)
        """
        
        # Basic statistics
        n_a, n_b = len(group_a), len(group_b)
        mean_a, mean_b = np.mean(group_a), np.mean(group_b)
        std_a, std_b = np.std(group_a, ddof=1), np.std(group_b, ddof=1)
        
        # Statistical test
        t_stat, p_value = ttest_ind(group_a, group_b)
        
        # Effect size (Cohen's d)
        pooled_std = np.sqrt((std_a**2 + std_b**2) / 2)
        cohens_d = (mean_b - mean_a) / pooled_std
        
        # Confidence interval for mean difference
        se_diff = np.sqrt(std_a**2/n_a + std_b**2/n_b)
        ci_95 = stats.t.interval(
            0.95, 
            n_a + n_b - 2,
            loc=mean_b - mean_a,
            scale=se_diff
        )
        
        # Relative change
        rel_change = (mean_b - mean_a) / mean_a * 100
        
        # Interpret effect size
        if abs(cohens_d) < 0.2:
            effect_interpretation = "negligible"
        elif abs(cohens_d) < 0.5:
            effect_interpretation = "small"
        elif abs(cohens_d) < 0.8:
            effect_interpretation = "medium"
        else:
            effect_interpretation = "large"
        
        # Practical significance
        practically_significant = abs(cohens_d) >= min_effect_size
        
        results = {
            'metric': metric_name,
            'n_a': n_a,
            'n_b': n_b,
            'mean_a': mean_a,
            'mean_b': mean_b,
            'mean_diff': mean_b - mean_a,
            'rel_change_pct': rel_change,
            'ci_95_lower': ci_95[0],
            'ci_95_upper': ci_95[1],
            'p_value': p_value,
            'cohens_d': cohens_d,
            'effect_interpretation': effect_interpretation,
            'statistically_significant': p_value < 0.05,
            'practically_significant': practically_significant
        }
        
        return results
    
    @staticmethod
    def print_test_results(results):
        """Pretty print test results"""
        print(f"\n{'='*70}")
        print(f"Two-Sample Test Results: {results['metric']}")
        print(f"{'='*70}")
        print(f"Sample sizes: A={results['n_a']:,}, B={results['n_b']:,}")
        print(f"\nMeans:")
        print(f"  Group A: {results['mean_a']:.4f}")
        print(f"  Group B: {results['mean_b']:.4f}")
        print(f"  Difference: {results['mean_diff']:.4f} ({results['rel_change_pct']:.2f}%)")
        print(f"  95% CI: [{results['ci_95_lower']:.4f}, {results['ci_95_upper']:.4f}]")
        print(f"\nStatistical Significance:")
        print(f"  p-value: {results['p_value']:.6f}")
        print(f"  Significant (α=0.05): {results['statistically_significant']}")
        print(f"\nEffect Size:")
        print(f"  Cohen's d: {results['cohens_d']:.4f}")
        print(f"  Interpretation: {results['effect_interpretation']}")
        print(f"  Practically Significant: {results['practically_significant']}")
        print(f"{'='*70}\n")
    
    @staticmethod
    def proportion_test(successes_a, n_a, successes_b, n_b, 
                       metric_name="conversion rate"):
        """
        Two-proportion z-test with effect size
        Used for conversion rates, click-through rates, etc.
        """
        
        # Proportions
        p_a = successes_a / n_a
        p_b = successes_b / n_b
        
        # Pooled proportion
        p_pooled = (successes_a + successes_b) / (n_a + n_b)
        
        # Standard error
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
        
        # Z-statistic
        z_stat = (p_b - p_a) / se
        p_value = 2 * (1 - norm.cdf(abs(z_stat)))
        
        # Effect size (h - Cohen's h for proportions)
        cohens_h = 2 * (np.arcsin(np.sqrt(p_b)) - np.arcsin(np.sqrt(p_a)))
        
        # Confidence interval
        se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
        ci_95 = (
            (p_b - p_a) - 1.96 * se_diff,
            (p_b - p_a) + 1.96 * se_diff
        )
        
        # Relative lift
        relative_lift = (p_b - p_a) / p_a * 100
        
        return {
            'metric': metric_name,
            'n_a': n_a,
            'n_b': n_b,
            'p_a': p_a,
            'p_b': p_b,
            'diff': p_b - p_a,
            'relative_lift_pct': relative_lift,
            'ci_95': ci_95,
            'z_stat': z_stat,
            'p_value': p_value,
            'cohens_h': cohens_h,
            'significant': p_value < 0.05
        }

# Example usage
# np.random.seed(42)
# group_a = np.random.normal(100, 15, 100000)
# group_b = np.random.normal(102, 15, 100000)  # 2% lift
# 
# tester = LargeSampleTesting()
# results = tester.two_sample_test(group_a, group_b, "Revenue per User", min_effect_size=0.1)
# tester.print_test_results(results)

## 3. Handling Massive Sample Sizes <a name="massive-samples"></a>

In [None]:
def statistical_test_in_redshift(rs_conn, table, metric_col, group_col, 
                                  group_a_val, group_b_val):
    """
    Perform statistical test directly in Redshift
    Avoids loading millions of rows into memory
    """
    
    query = f"""
    WITH group_stats AS (
        SELECT 
            {group_col} as group_name,
            COUNT(*) as n,
            AVG({metric_col}) as mean,
            STDDEV({metric_col}) as std,
            VARIANCE({metric_col}) as var
        FROM {table}
        WHERE {group_col} IN ('{group_a_val}', '{group_b_val}')
            AND {metric_col} IS NOT NULL
        GROUP BY {group_col}
    ),
    test_stats AS (
        SELECT 
            MAX(CASE WHEN group_name = '{group_a_val}' THEN n END) as n_a,
            MAX(CASE WHEN group_name = '{group_b_val}' THEN n END) as n_b,
            MAX(CASE WHEN group_name = '{group_a_val}' THEN mean END) as mean_a,
            MAX(CASE WHEN group_name = '{group_b_val}' THEN mean END) as mean_b,
            MAX(CASE WHEN group_name = '{group_a_val}' THEN std END) as std_a,
            MAX(CASE WHEN group_name = '{group_b_val}' THEN std END) as std_b
        FROM group_stats
    )
    SELECT 
        n_a,
        n_b,
        mean_a,
        mean_b,
        std_a,
        std_b,
        mean_b - mean_a as mean_diff,
        (mean_b - mean_a) / mean_a * 100 as rel_change_pct,
        -- T-statistic
        (mean_b - mean_a) / SQRT((std_a*std_a/n_a) + (std_b*std_b/n_b)) as t_stat,
        -- Cohen's d
        (mean_b - mean_a) / SQRT((std_a*std_a + std_b*std_b) / 2) as cohens_d
    FROM test_stats
    """
    
    result = rs_conn.query(query)
    
    # Calculate p-value in Python (Redshift doesn't have t-distribution)
    if not result.empty:
        t_stat = result['t_stat'].iloc[0]
        df = result['n_a'].iloc[0] + result['n_b'].iloc[0] - 2
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        result['p_value'] = p_value
    
    return result

# Example usage
# results = statistical_test_in_redshift(
#     rs_stats, 
#     'marketing_events', 
#     'revenue', 
#     'channel', 
#     'google', 
#     'facebook'
# )
# print(results)

## 4. Bootstrap and Resampling at Scale <a name="bootstrap"></a>

In [None]:
class ScalableBootstrap:
    """
    Bootstrap methods optimized for large datasets
    """
    
    @staticmethod
    def bootstrap_ci(data, statistic_func, n_bootstrap=1000, ci=0.95, 
                     subsample_size=None):
        """
        Calculate bootstrap confidence interval
        
        Args:
            data: Array of data
            statistic_func: Function to calculate statistic (e.g., np.mean)
            n_bootstrap: Number of bootstrap samples
            ci: Confidence level
            subsample_size: If set, use subsampling for efficiency
        """
        
        bootstrap_stats = []
        
        # Use subsampling for very large datasets
        sample_size = subsample_size if subsample_size else len(data)
        
        for i in range(n_bootstrap):
            # Bootstrap sample
            sample = resample(data, n_samples=sample_size, replace=True)
            # Calculate statistic
            stat = statistic_func(sample)
            bootstrap_stats.append(stat)
        
        # Calculate confidence interval
        alpha = 1 - ci
        lower_percentile = (alpha / 2) * 100
        upper_percentile = (1 - alpha / 2) * 100
        
        ci_lower = np.percentile(bootstrap_stats, lower_percentile)
        ci_upper = np.percentile(bootstrap_stats, upper_percentile)
        
        return {
            'estimate': statistic_func(data),
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'ci_level': ci,
            'bootstrap_distribution': bootstrap_stats
        }
    
    @staticmethod
    def bootstrap_hypothesis_test(group_a, group_b, statistic_func=np.mean,
                                   n_bootstrap=10000, subsample_size=10000):
        """
        Bootstrap hypothesis test for difference between groups
        
        H0: No difference between groups
        """
        
        # Observed difference
        obs_diff = statistic_func(group_b) - statistic_func(group_a)
        
        # Subsample if datasets are too large
        if subsample_size and len(group_a) > subsample_size:
            group_a = np.random.choice(group_a, subsample_size, replace=False)
            group_b = np.random.choice(group_b, subsample_size, replace=False)
        
        # Pool data under null hypothesis
        pooled = np.concatenate([group_a, group_b])
        n_a = len(group_a)
        n_b = len(group_b)
        
        # Bootstrap null distribution
        null_diffs = []
        for i in range(n_bootstrap):
            # Shuffle and split
            shuffled = resample(pooled, n_samples=len(pooled), replace=False)
            boot_a = shuffled[:n_a]
            boot_b = shuffled[n_a:]
            
            # Calculate difference
            diff = statistic_func(boot_b) - statistic_func(boot_a)
            null_diffs.append(diff)
        
        # Calculate p-value
        p_value = np.mean(np.abs(null_diffs) >= np.abs(obs_diff))
        
        return {
            'observed_diff': obs_diff,
            'p_value': p_value,
            'null_distribution': null_diffs
        }
    
    @staticmethod
    def visualize_bootstrap_results(results, title="Bootstrap Results"):
        """
        Visualize bootstrap distribution and confidence interval
        """
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # Bootstrap distribution
        ax1.hist(results['bootstrap_distribution'], bins=50, 
                edgecolor='black', alpha=0.7)
        ax1.axvline(results['estimate'], color='red', linestyle='--', 
                   linewidth=2, label='Estimate')
        ax1.axvline(results['ci_lower'], color='green', linestyle='--', 
                   linewidth=2, label=f'{results["ci_level"]*100}% CI')
        ax1.axvline(results['ci_upper'], color='green', linestyle='--', 
                   linewidth=2)
        ax1.set_xlabel('Statistic Value')
        ax1.set_ylabel('Frequency')
        ax1.set_title('Bootstrap Distribution')
        ax1.legend()
        
        # Box plot
        ax2.boxplot(results['bootstrap_distribution'], vert=True)
        ax2.set_ylabel('Statistic Value')
        ax2.set_title('Bootstrap Distribution (Box Plot)')
        ax2.grid(True, alpha=0.3)
        
        plt.suptitle(title, fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

# Example usage
# np.random.seed(42)
# data = np.random.lognormal(3, 1, 100000)  # Large skewed dataset
# 
# bs = ScalableBootstrap()
# results = bs.bootstrap_ci(data, np.median, n_bootstrap=1000, subsample_size=10000)
# print(f"Median: {results['estimate']:.2f}")
# print(f"95% CI: [{results['ci_lower']:.2f}, {results['ci_upper']:.2f}]")
# bs.visualize_bootstrap_results(results, "Bootstrap CI for Median")

## 5. Power Analysis for Large Experiments <a name="power-analysis"></a>

In [None]:
class PowerAnalyzer:
    """
    Power analysis for A/B tests and experiments
    """
    
    @staticmethod
    def calculate_required_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
        """
        Calculate required sample size for proportion test
        
        Args:
            baseline_rate: Current conversion rate (e.g., 0.03)
            mde: Minimum detectable effect (relative, e.g., 0.05 for 5% lift)
            alpha: Significance level
            power: Statistical power (1 - beta)
        """
        
        # Target rate
        target_rate = baseline_rate * (1 + mde)
        
        # Effect size
        effect_size = proportion_effectsize(baseline_rate, target_rate)
        
        # Calculate sample size
        analysis = TTestIndPower()
        n_per_group = analysis.solve_power(
            effect_size=effect_size,
            alpha=alpha,
            power=power,
            ratio=1.0,
            alternative='two-sided'
        )
        
        return {
            'n_per_group': int(np.ceil(n_per_group)),
            'total_sample_size': int(np.ceil(n_per_group * 2)),
            'baseline_rate': baseline_rate,
            'target_rate': target_rate,
            'mde': mde,
            'effect_size': effect_size,
            'alpha': alpha,
            'power': power
        }
    
    @staticmethod
    def power_curve(baseline_rate, sample_size_per_group, alpha=0.05):
        """
        Generate power curve showing detectable effect vs power
        """
        
        # Range of relative effects to test
        relative_effects = np.linspace(0.01, 0.20, 50)
        
        powers = []
        for rel_effect in relative_effects:
            target_rate = baseline_rate * (1 + rel_effect)
            effect_size = proportion_effectsize(baseline_rate, target_rate)
            
            analysis = TTestIndPower()
            power = analysis.solve_power(
                effect_size=effect_size,
                nobs1=sample_size_per_group,
                alpha=alpha,
                ratio=1.0,
                alternative='two-sided'
            )
            powers.append(power)
        
        # Plot
        fig = go.Figure()
        
        fig.add_trace(go.Scatter(
            x=relative_effects * 100,
            y=powers,
            mode='lines',
            name='Power',
            line=dict(color='blue', width=2)
        ))
        
        # Add 80% power line
        fig.add_hline(y=0.80, line_dash="dash", line_color="red",
                     annotation_text="80% Power")
        
        fig.update_layout(
            title=f'Power Curve (n={sample_size_per_group:,} per group)',
            xaxis_title='Relative Effect (%)',
            yaxis_title='Statistical Power',
            yaxis=dict(range=[0, 1]),
            height=500
        )
        
        fig.show()
        
        return pd.DataFrame({
            'relative_effect_pct': relative_effects * 100,
            'power': powers
        })
    
    @staticmethod
    def runtime_calculator(daily_traffic, n_required):
        """
        Calculate how long experiment needs to run
        """
        
        # Per group
        daily_per_group = daily_traffic / 2
        days_required = np.ceil(n_required / daily_per_group)
        
        return {
            'days_required': int(days_required),
            'weeks_required': days_required / 7,
            'daily_traffic': daily_traffic,
            'daily_per_group': daily_per_group,
            'total_required': n_required * 2
        }
    
    @staticmethod
    def sequential_testing_boundaries(n_max, alpha=0.05, n_looks=5):
        """
        Calculate adjusted significance levels for sequential testing
        Using O'Brien-Fleming spending function
        """
        
        # Information fractions
        info_fractions = np.linspace(0.2, 1.0, n_looks)
        
        # O'Brien-Fleming boundaries
        boundaries = []
        for fraction in info_fractions:
            z_boundary = stats.norm.ppf(1 - alpha/2) / np.sqrt(fraction)
            p_boundary = 2 * (1 - stats.norm.cdf(z_boundary))
            boundaries.append(p_boundary)
        
        return pd.DataFrame({
            'look_number': range(1, n_looks + 1),
            'sample_size': (info_fractions * n_max).astype(int),
            'info_fraction': info_fractions,
            'p_value_boundary': boundaries
        })

# Example usage
# pa = PowerAnalyzer()
# 
# # Calculate required sample size
# result = pa.calculate_required_sample_size(
#     baseline_rate=0.03,  # 3% conversion rate
#     mde=0.10,  # Want to detect 10% lift
#     power=0.80
# )
# print(f"Required sample size per group: {result['n_per_group']:,}")
# print(f"Total sample size: {result['total_sample_size']:,}")
# 
# # Calculate runtime
# runtime = pa.runtime_calculator(daily_traffic=50000, n_required=result['n_per_group'])
# print(f"Days required: {runtime['days_required']}")
# print(f"Weeks required: {runtime['weeks_required']:.1f}")

## 6. Bayesian Statistics for Marketing <a name="bayesian"></a>

In [None]:
class BayesianMarketingAnalysis:
    """
    Bayesian methods for marketing analytics
    Better for continuous monitoring and incorporating prior knowledge
    """
    
    @staticmethod
    def bayesian_ab_test(conversions_a, visitors_a, conversions_b, visitors_b,
                         prior_alpha=1, prior_beta=1, n_samples=10000):
        """
        Bayesian A/B test for conversion rates
        Uses Beta-Binomial conjugate prior
        """
        
        # Posterior distributions
        posterior_a = beta(prior_alpha + conversions_a, 
                          prior_beta + visitors_a - conversions_a)
        posterior_b = beta(prior_alpha + conversions_b, 
                          prior_beta + visitors_b - conversions_b)
        
        # Sample from posteriors
        samples_a = posterior_a.rvs(n_samples)
        samples_b = posterior_b.rvs(n_samples)
        
        # Probability B > A
        prob_b_better = np.mean(samples_b > samples_a)
        
        # Expected lift
        lift_samples = (samples_b - samples_a) / samples_a
        expected_lift = np.mean(lift_samples)
        lift_ci = np.percentile(lift_samples, [2.5, 97.5])
        
        # Expected loss (risk of choosing wrong variant)
        loss_a = np.mean(np.maximum(samples_b - samples_a, 0))
        loss_b = np.mean(np.maximum(samples_a - samples_b, 0))
        
        return {
            'prob_b_better': prob_b_better,
            'prob_a_better': 1 - prob_b_better,
            'expected_lift': expected_lift,
            'lift_ci_95': lift_ci,
            'expected_loss_a': loss_a,
            'expected_loss_b': loss_b,
            'samples_a': samples_a,
            'samples_b': samples_b,
            'posterior_a_mean': posterior_a.mean(),
            'posterior_b_mean': posterior_b.mean()
        }
    
    @staticmethod
    def visualize_bayesian_test(results):
        """
        Visualize Bayesian A/B test results
        """
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # 1. Posterior distributions
        axes[0, 0].hist(results['samples_a'], bins=50, alpha=0.5, 
                       label='Variant A', density=True)
        axes[0, 0].hist(results['samples_b'], bins=50, alpha=0.5, 
                       label='Variant B', density=True)
        axes[0, 0].set_xlabel('Conversion Rate')
        axes[0, 0].set_ylabel('Density')
        axes[0, 0].set_title('Posterior Distributions')
        axes[0, 0].legend()
        
        # 2. Lift distribution
        lift = (results['samples_b'] - results['samples_a']) / results['samples_a'] * 100
        axes[0, 1].hist(lift, bins=50, edgecolor='black', alpha=0.7)
        axes[0, 1].axvline(0, color='red', linestyle='--', linewidth=2)
        axes[0, 1].axvline(np.percentile(lift, 2.5), color='green', 
                          linestyle='--', label='95% CI')
        axes[0, 1].axvline(np.percentile(lift, 97.5), color='green', 
                          linestyle='--')
        axes[0, 1].set_xlabel('Lift (%)')
        axes[0, 1].set_ylabel('Frequency')
        axes[0, 1].set_title('Distribution of Lift')
        axes[0, 1].legend()
        
        # 3. Probability B beats A
        categories = ['A Better', 'B Better']
        probs = [results['prob_a_better'], results['prob_b_better']]
        colors = ['#ff6b6b' if p < 0.5 else '#51cf66' for p in probs]
        axes[1, 0].bar(categories, probs, color=colors, edgecolor='black')
        axes[1, 0].set_ylabel('Probability')
        axes[1, 0].set_title('Probability of Being Better')
        axes[1, 0].set_ylim([0, 1])
        for i, v in enumerate(probs):
            axes[1, 0].text(i, v + 0.02, f'{v:.1%}', 
                           ha='center', fontweight='bold')
        
        # 4. Expected loss
        losses = [results['expected_loss_a'], results['expected_loss_b']]
        axes[1, 1].bar(['Choose A', 'Choose B'], losses, 
                      color=['#ff6b6b', '#51cf66'], edgecolor='black')
        axes[1, 1].set_ylabel('Expected Loss')
        axes[1, 1].set_title('Expected Loss (Risk)')
        for i, v in enumerate(losses):
            axes[1, 1].text(i, v + max(losses)*0.02, f'{v:.4f}', 
                           ha='center', fontweight='bold')
        
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def bayesian_revenue_test(revenue_a, revenue_b, n_samples=10000):
        """
        Bayesian test for revenue (continuous metric)
        Uses PyMC3 for more complex modeling
        """
        
        with pm.Model() as model:
            # Priors
            mu_a = pm.Normal('mu_a', mu=np.mean(revenue_a), sd=np.std(revenue_a))
            mu_b = pm.Normal('mu_b', mu=np.mean(revenue_b), sd=np.std(revenue_b))
            
            sigma_a = pm.HalfNormal('sigma_a', sd=np.std(revenue_a))
            sigma_b = pm.HalfNormal('sigma_b', sd=np.std(revenue_b))
            
            # Likelihood
            obs_a = pm.Normal('obs_a', mu=mu_a, sd=sigma_a, observed=revenue_a)
            obs_b = pm.Normal('obs_b', mu=mu_b, sd=sigma_b, observed=revenue_b)
            
            # Difference
            diff = pm.Deterministic('diff', mu_b - mu_a)
            lift = pm.Deterministic('lift', (mu_b - mu_a) / mu_a)
            
            # Sample
            trace = pm.sample(n_samples, return_inferencedata=True, 
                            progressbar=False)
        
        # Extract results
        diff_samples = trace.posterior['diff'].values.flatten()
        lift_samples = trace.posterior['lift'].values.flatten()
        
        return {
            'prob_b_better': np.mean(diff_samples > 0),
            'expected_diff': np.mean(diff_samples),
            'diff_ci_95': np.percentile(diff_samples, [2.5, 97.5]),
            'expected_lift': np.mean(lift_samples),
            'lift_ci_95': np.percentile(lift_samples, [2.5, 97.5]),
            'trace': trace
        }

# Example usage
# bayes = BayesianMarketingAnalysis()
# 
# # Simulate A/B test data
# conversions_a, visitors_a = 250, 10000
# conversions_b, visitors_b = 275, 10000
# 
# results = bayes.bayesian_ab_test(conversions_a, visitors_a, 
#                                  conversions_b, visitors_b)
# 
# print(f"Probability B is better: {results['prob_b_better']:.1%}")
# print(f"Expected lift: {results['expected_lift']:.2%}")
# print(f"95% CI: [{results['lift_ci_95'][0]:.2%}, {results['lift_ci_95'][1]:.2%}]")
# 
# bayes.visualize_bayesian_test(results)

## 7. Time Series Analysis at Scale <a name="timeseries"></a>

In [None]:
class TimeSeriesAnalyzer:
    """
    Time series analysis for marketing metrics
    """
    
    @staticmethod
    def decompose_series(df, date_col, value_col, freq=7):
        """
        Decompose time series into trend, seasonal, and residual
        
        Args:
            df: DataFrame with time series data
            date_col: Date column name
            value_col: Value column name
            freq: Seasonal frequency (7 for weekly, 30 for monthly)
        """
        
        # Ensure sorted by date
        df_sorted = df.sort_values(date_col)
        
        # Decomposition
        decomposition = seasonal_decompose(
            df_sorted[value_col],
            model='additive',
            period=freq
        )
        
        # Plot
        fig, axes = plt.subplots(4, 1, figsize=(14, 10))
        
        # Original
        axes[0].plot(df_sorted[date_col], df_sorted[value_col])
        axes[0].set_ylabel('Original')
        axes[0].set_title('Time Series Decomposition')
        
        # Trend
        axes[1].plot(df_sorted[date_col], decomposition.trend)
        axes[1].set_ylabel('Trend')
        
        # Seasonal
        axes[2].plot(df_sorted[date_col], decomposition.seasonal)
        axes[2].set_ylabel('Seasonal')
        
        # Residual
        axes[3].plot(df_sorted[date_col], decomposition.resid)
        axes[3].set_ylabel('Residual')
        axes[3].set_xlabel('Date')
        
        plt.tight_layout()
        plt.show()
        
        return decomposition
    
    @staticmethod
    def detect_anomalies(df, date_col, value_col, threshold=3):
        """
        Detect anomalies using statistical methods
        """
        
        # Calculate rolling statistics
        df = df.copy()
        df['rolling_mean'] = df[value_col].rolling(window=7, center=True).mean()
        df['rolling_std'] = df[value_col].rolling(window=7, center=True).std()
        
        # Z-score
        df['z_score'] = (df[value_col] - df['rolling_mean']) / df['rolling_std']
        
        # Flag anomalies
        df['is_anomaly'] = abs(df['z_score']) > threshold
        
        # Visualize
        fig = go.Figure()
        
        # Normal points
        fig.add_trace(go.Scatter(
            x=df[~df['is_anomaly']][date_col],
            y=df[~df['is_anomaly']][value_col],
            mode='lines+markers',
            name='Normal',
            line=dict(color='blue')
        ))
        
        # Anomalies
        fig.add_trace(go.Scatter(
            x=df[df['is_anomaly']][date_col],
            y=df[df['is_anomaly']][value_col],
            mode='markers',
            name='Anomaly',
            marker=dict(color='red', size=10, symbol='x')
        ))
        
        # Confidence bands
        fig.add_trace(go.Scatter(
            x=df[date_col],
            y=df['rolling_mean'] + threshold * df['rolling_std'],
            mode='lines',
            name='Upper bound',
            line=dict(color='gray', dash='dash')
        ))
        
        fig.add_trace(go.Scatter(
            x=df[date_col],
            y=df['rolling_mean'] - threshold * df['rolling_std'],
            mode='lines',
            name='Lower bound',
            line=dict(color='gray', dash='dash')
        ))
        
        fig.update_layout(
            title='Anomaly Detection',
            xaxis_title='Date',
            yaxis_title='Value',
            height=500
        )
        
        fig.show()
        
        return df[df['is_anomaly']]
    
    @staticmethod
    def forecast_with_prophet(df, date_col, value_col, periods=30):
        """
        Forecast using Facebook Prophet
        Good for marketing data with trends and seasonality
        """
        
        # Prepare data for Prophet
        df_prophet = df[[date_col, value_col]].copy()
        df_prophet.columns = ['ds', 'y']
        
        # Fit model
        model = Prophet(
            yearly_seasonality=True,
            weekly_seasonality=True,
            daily_seasonality=False
        )
        model.fit(df_prophet)
        
        # Make future dataframe
        future = model.make_future_dataframe(periods=periods)
        
        # Predict
        forecast = model.predict(future)
        
        # Plot
        fig = model.plot(forecast)
        plt.title(f'{value_col} Forecast')
        plt.tight_layout()
        plt.show()
        
        # Plot components
        fig = model.plot_components(forecast)
        plt.tight_layout()
        plt.show()
        
        return forecast

# Example usage
# # Generate sample time series
# dates = pd.date_range('2023-01-01', periods=365, freq='D')
# trend = np.linspace(100, 150, 365)
# seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 7)
# noise = np.random.normal(0, 5, 365)
# values = trend + seasonal + noise
# 
# df_ts = pd.DataFrame({'date': dates, 'revenue': values})
# 
# tsa = TimeSeriesAnalyzer()
# decomp = tsa.decompose_series(df_ts, 'date', 'revenue', freq=7)
# anomalies = tsa.detect_anomalies(df_ts, 'date', 'revenue', threshold=3)

## 8. Real-World Project: 50M Conversion Analysis <a name="project"></a>

In [None]:
"""
PROJECT: Statistical Analysis of 50M Conversions

Scenario:
You have a Redshift table with 50M conversion events from the past year.
Perform comprehensive statistical analysis to answer:

1. Which channels have significantly different conversion rates?
2. What is the confidence interval for revenue per conversion?
3. Are there seasonal patterns in conversion rates?
4. Which variants in ongoing A/B tests should we choose?
5. Can we forecast next month's conversions?

Requirements:
- All analysis must handle 50M+ rows efficiently
- Use appropriate statistical tests
- Calculate effect sizes and confidence intervals
- Create visualizations
- Provide actionable recommendations
"""

class ConversionAnalysisProject:
    """
    Complete statistical analysis framework for large conversion dataset
    """
    
    def __init__(self, rs_conn, table_name):
        self.rs_conn = rs_conn
        self.table = table_name
        self.results = {}
    
    def run_complete_analysis(self):
        """Execute full analysis pipeline"""
        print("Starting 50M Conversion Analysis...\n")
        
        # 1. Channel comparison
        print("[1/5] Analyzing channel differences...")
        self.results['channels'] = self.analyze_channels()
        
        # 2. Revenue analysis
        print("[2/5] Analyzing revenue distribution...")
        self.results['revenue'] = self.analyze_revenue()
        
        # 3. Time series analysis
        print("[3/5] Analyzing temporal patterns...")
        self.results['timeseries'] = self.analyze_timeseries()
        
        # 4. A/B test analysis
        print("[4/5] Analyzing A/B tests...")
        self.results['ab_tests'] = self.analyze_ab_tests()
        
        # 5. Forecasting
        print("[5/5] Generating forecasts...")
        self.results['forecast'] = self.generate_forecast()
        
        print("\n✓ Analysis complete!")
        return self.results
    
    def analyze_channels(self):
        """
        Compare conversion rates across channels
        """
        # Get channel statistics from Redshift
        query = f"""
        SELECT 
            channel,
            COUNT(*) as total_events,
            SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions,
            AVG(CASE WHEN converted = 1 THEN 1.0 ELSE 0.0 END) as conversion_rate,
            STDDEV(CASE WHEN converted = 1 THEN 1.0 ELSE 0.0 END) as std_dev
        FROM {self.table}
        GROUP BY channel
        ORDER BY total_events DESC
        """
        
        df = self.rs_conn.query(query)
        
        # Pairwise comparisons
        tester = LargeSampleTesting()
        comparisons = []
        
        for i in range(len(df)):
            for j in range(i+1, len(df)):
                channel_a = df.iloc[i]
                channel_b = df.iloc[j]
                
                result = tester.proportion_test(
                    channel_a['conversions'], channel_a['total_events'],
                    channel_b['conversions'], channel_b['total_events'],
                    f"{channel_a['channel']} vs {channel_b['channel']}"
                )
                comparisons.append(result)
        
        return {
            'channel_stats': df,
            'comparisons': comparisons
        }
    
    def analyze_revenue(self):
        """
        Analyze revenue distribution with bootstrap CI
        """
        # Sample revenue data (can't load 50M into memory)
        query = f"""
        SELECT revenue
        FROM {self.table}
        WHERE converted = 1
            AND revenue > 0
        ORDER BY RANDOM()
        LIMIT 100000
        """
        
        df = self.rs_conn.query(query)
        revenue_data = df['revenue'].values
        
        # Bootstrap CI for median
        bs = ScalableBootstrap()
        ci_result = bs.bootstrap_ci(
            revenue_data,
            np.median,
            n_bootstrap=1000
        )
        
        return ci_result
    
    def analyze_timeseries(self):
        """
        Analyze temporal patterns
        """
        # Get daily conversion rates
        query = f"""
        SELECT 
            DATE(timestamp) as date,
            COUNT(*) as total,
            SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions,
            AVG(CASE WHEN converted = 1 THEN 1.0 ELSE 0.0 END) as conversion_rate
        FROM {self.table}
        GROUP BY DATE(timestamp)
        ORDER BY date
        """
        
        df = self.rs_conn.query(query)
        
        # Decompose
        tsa = TimeSeriesAnalyzer()
        decomp = tsa.decompose_series(df, 'date', 'conversion_rate', freq=7)
        
        return {
            'daily_data': df,
            'decomposition': decomp
        }
    
    def analyze_ab_tests(self):
        """
        Analyze active A/B tests
        """
        # Assume we have variant column
        query = f"""
        SELECT 
            variant,
            COUNT(*) as visitors,
            SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions
        FROM {self.table}
        WHERE variant IS NOT NULL
        GROUP BY variant
        """
        
        df = self.rs_conn.query(query)
        
        if len(df) >= 2:
            # Bayesian analysis
            bayes = BayesianMarketingAnalysis()
            result = bayes.bayesian_ab_test(
                df.iloc[0]['conversions'], df.iloc[0]['visitors'],
                df.iloc[1]['conversions'], df.iloc[1]['visitors']
            )
            return result
        
        return None
    
    def generate_forecast(self):
        """
        Forecast next 30 days of conversions
        """
        # Get historical daily conversions
        query = f"""
        SELECT 
            DATE(timestamp) as date,
            SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions
        FROM {self.table}
        GROUP BY DATE(timestamp)
        ORDER BY date
        """
        
        df = self.rs_conn.query(query)
        
        # Forecast with Prophet
        tsa = TimeSeriesAnalyzer()
        forecast = tsa.forecast_with_prophet(df, 'date', 'conversions', periods=30)
        
        return forecast
    
    def generate_report(self):
        """
        Generate executive report
        """
        # Implementation left as exercise for students
        pass

# Example usage
# project = ConversionAnalysisProject(rs_stats, 'conversion_events')
# results = project.run_complete_analysis()
# report = project.generate_report()

## 9. Exercises <a name="exercises"></a>

### Exercise 1: Channel Performance Analysis

**Task:** Using a large marketing dataset:
1. Compare conversion rates across all channels
2. Calculate effect sizes for all pairwise comparisons
3. Adjust for multiple comparisons (Bonferroni correction)
4. Visualize results
5. Provide recommendations

In [None]:
# Your solution here


### Exercise 2: Sample Size Planning

**Task:** Plan an A/B test:
1. Current conversion rate: 2.5%
2. Want to detect 5% relative lift
3. Calculate required sample size
4. Given 100k daily visitors, how long to run?
5. Create power curve
6. Set up sequential testing boundaries

In [None]:
# Your solution here


### Exercise 3: Bayesian vs Frequentist Comparison

**Task:** Analyze the same A/B test both ways:
1. Perform frequentist test (t-test)
2. Perform Bayesian test
3. Compare conclusions
4. Discuss pros/cons of each approach
5. When would you use each?

In [None]:
# Your solution here


### Exercise 4: Time Series Anomaly Detection

**Task:** Build anomaly detection system:
1. Load daily revenue data
2. Implement multiple anomaly detection methods
3. Compare methods
4. Create automated alerting logic
5. Test on historical data with known anomalies

In [None]:
# Your solution here


## Summary

In this notebook, you learned:

1. **Large Sample Testing**
   - Why p-values aren't enough
   - Importance of effect sizes
   - Practical vs statistical significance

2. **Efficient Analysis Methods**
   - In-database statistical tests
   - Smart sampling strategies
   - Bootstrap at scale

3. **Experimental Design**
   - Power analysis
   - Sample size calculation
   - Sequential testing

4. **Bayesian Methods**
   - Bayesian A/B testing
   - Continuous monitoring
   - Incorporating priors

5. **Time Series Analysis**
   - Decomposition
   - Anomaly detection
   - Forecasting

## Next Steps

- Week 8: Advanced A/B Testing Platform
- Practice with real datasets
- Build automated analysis pipelines
- Implement production systems

## Additional Resources

- [Evan Miller's A/B Tools](https://www.evanmiller.org/ab-testing/)
- [Trustworthy Online Controlled Experiments](https://www.amazon.com/Trustworthy-Online-Controlled-Experiments-Practical/dp/1108724264)
- [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers)
- [Prophet Documentation](https://facebook.github.io/prophet/)