# Week 5: Advanced EDA at Scale

## Learning Objectives
- Perform EDA on 10M+ row datasets efficiently
- Use Redshift SQL for statistical analysis at scale
- Implement distributed computing with Dask for EDA
- Apply advanced outlier detection techniques at scale
- Execute correlation analysis on massive tables
- Build automated data profiling pipelines
- Design and implement effective sampling strategies
- Profile 100M row marketing databases

## Prerequisites
```bash
pip install pandas numpy scipy statsmodels dask[complete]
pip install psycopg2-binary sqlalchemy redshift_connector
pip install seaborn matplotlib plotly
pip install pandas-profiling great-expectations
pip install pyarrow fastparquet scikit-learn
```

## 1. Setup and Environment

In [None]:
# Standard library
import os
import sys
import logging
import time
import warnings
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional
import gc

# Data manipulation
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client, progress

# Statistics
from scipy import stats
from scipy.stats import norm, chi2, kstest, anderson
from statsmodels import robust
from statsmodels.stats import weightstats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

# Database
import psycopg2
from sqlalchemy import create_engine, text

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.4f}'.format)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"Pandas: {pd.__version__}")
print(f"Dask: {dask.__version__}")
print(f"NumPy: {np.__version__}")
print("Environment ready!")

## 2. Redshift Connection for Large-Scale EDA

In [None]:
class RedshiftEDA:
    """Redshift connection manager optimized for EDA queries."""
    
    def __init__(self, config: Dict[str, str]):
        self.config = config
        self.engine = None
        self.logger = logging.getLogger(self.__class__.__name__)
        self._create_engine()
    
    def _create_engine(self):
        """Create SQLAlchemy engine with optimized settings."""
        conn_str = (
            f"postgresql+psycopg2://{self.config['user']}:{self.config['password']}"
            f"@{self.config['host']}:{self.config.get('port', 5439)}/{self.config['database']}"
        )
        
        self.engine = create_engine(
            conn_str,
            pool_size=5,
            max_overflow=10,
            pool_pre_ping=True,
            pool_recycle=3600
        )
        self.logger.info("Redshift engine created")
    
    def get_table_profile(self, table: str, sample_size: int = 100000) -> Dict:
        """Get comprehensive profile of a table."""
        self.logger.info(f"Profiling table: {table}")
        
        profile = {}
        
        # Basic stats
        basic_query = f"""
        SELECT 
            COUNT(*) as row_count,
            COUNT(DISTINCT *) as unique_rows
        FROM {table}
        """
        profile['basic'] = pd.read_sql(basic_query, self.engine).to_dict('records')[0]
        
        # Column information
        columns_query = f"""
        SELECT 
            column_name,
            data_type,
            is_nullable
        FROM information_schema.columns
        WHERE table_name = '{table.split('.')[-1]}'
        ORDER BY ordinal_position
        """
        profile['columns'] = pd.read_sql(columns_query, self.engine)
        
        # Sample data
        sample_query = f"SELECT * FROM {table} LIMIT {sample_size}"
        profile['sample'] = pd.read_sql(sample_query, self.engine)
        
        return profile
    
    def compute_statistics(self, table: str, column: str) -> pd.DataFrame:
        """Compute comprehensive statistics for a column."""
        query = f"""
        SELECT 
            COUNT(*) as count,
            COUNT(DISTINCT {column}) as unique_count,
            COUNT(*) - COUNT({column}) as null_count,
            MIN({column}) as min_value,
            MAX({column}) as max_value,
            AVG({column}::FLOAT) as mean,
            STDDEV({column}::FLOAT) as stddev,
            PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY {column}) as q25,
            PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY {column}) as median,
            PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY {column}) as q75,
            PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY {column}) as p90,
            PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY {column}) as p95,
            PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY {column}) as p99
        FROM {table}
        WHERE {column} IS NOT NULL
        """
        
        return pd.read_sql(query, self.engine)
    
    def detect_outliers_sql(self, table: str, column: str, method: str = 'iqr') -> pd.DataFrame:
        """Detect outliers using SQL."""
        if method == 'iqr':
            query = f"""
            WITH stats AS (
                SELECT 
                    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY {column}) as q1,
                    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY {column}) as q3
                FROM {table}
            ),
            bounds AS (
                SELECT 
                    q1 - 1.5 * (q3 - q1) as lower_bound,
                    q3 + 1.5 * (q3 - q1) as upper_bound
                FROM stats
            )
            SELECT 
                *,
                CASE 
                    WHEN {column} < (SELECT lower_bound FROM bounds) THEN 'low_outlier'
                    WHEN {column} > (SELECT upper_bound FROM bounds) THEN 'high_outlier'
                    ELSE 'normal'
                END as outlier_flag
            FROM {table}
            WHERE {column} IS NOT NULL
            """
        elif method == 'zscore':
            query = f"""
            WITH stats AS (
                SELECT 
                    AVG({column}::FLOAT) as mean,
                    STDDEV({column}::FLOAT) as stddev
                FROM {table}
            )
            SELECT 
                *,
                ABS(({column} - (SELECT mean FROM stats)) / (SELECT stddev FROM stats)) as z_score,
                CASE 
                    WHEN ABS(({column} - (SELECT mean FROM stats)) / (SELECT stddev FROM stats)) > 3 THEN 'outlier'
                    ELSE 'normal'
                END as outlier_flag
            FROM {table}
            WHERE {column} IS NOT NULL
            """
        
        return pd.read_sql(query, self.engine)
    
    def correlation_matrix(self, table: str, columns: List[str], sample_size: int = 1000000) -> pd.DataFrame:
        """Compute correlation matrix for large table."""
        cols_str = ', '.join(columns)
        query = f"SELECT {cols_str} FROM {table} WHERE RANDOM() < {sample_size} / (SELECT COUNT(*) FROM {table})"
        
        df = pd.read_sql(query, self.engine)
        return df.corr()


# Configuration
REDSHIFT_CONFIG = {
    'host': os.getenv('REDSHIFT_HOST', 'your-cluster.region.redshift.amazonaws.com'),
    'port': int(os.getenv('REDSHIFT_PORT', 5439)),
    'database': os.getenv('REDSHIFT_DB', 'marketing'),
    'user': os.getenv('REDSHIFT_USER', 'analyst'),
    'password': os.getenv('REDSHIFT_PASSWORD', 'password')
}

# rs_eda = RedshiftEDA(REDSHIFT_CONFIG)
print("Redshift EDA manager configured")

## 3. EDA on 10M+ Row Datasets

In [None]:
class LargeScaleEDA:
    """Perform EDA on massive datasets efficiently."""
    
    def __init__(self, chunk_size: int = 100000):
        self.chunk_size = chunk_size
        self.logger = logging.getLogger(self.__class__.__name__)
    
    def chunked_statistics(self, file_path: str) -> pd.DataFrame:
        """Calculate statistics using chunked processing."""
        self.logger.info(f"Computing statistics for {file_path}")
        
        stats_list = []
        total_rows = 0
        
        for chunk_num, chunk in enumerate(pd.read_csv(file_path, chunksize=self.chunk_size), 1):
            chunk_stats = {
                'chunk': chunk_num,
                'rows': len(chunk),
                'numeric_cols': chunk.select_dtypes(include=[np.number]).columns.tolist(),
                'mean': chunk.select_dtypes(include=[np.number]).mean().to_dict(),
                'std': chunk.select_dtypes(include=[np.number]).std().to_dict(),
                'min': chunk.select_dtypes(include=[np.number]).min().to_dict(),
                'max': chunk.select_dtypes(include=[np.number]).max().to_dict()
            }
            stats_list.append(chunk_stats)
            total_rows += len(chunk)
            
            if chunk_num % 10 == 0:
                self.logger.info(f"Processed {total_rows:,} rows")
                gc.collect()
        
        return pd.DataFrame(stats_list)
    
    def incremental_histogram(self, file_path: str, column: str, bins: int = 50) -> Tuple:
        """Build histogram incrementally from large file."""
        self.logger.info(f"Building histogram for {column}")
        
        # First pass: find min/max
        min_val = float('inf')
        max_val = float('-inf')
        
        for chunk in pd.read_csv(file_path, chunksize=self.chunk_size):
            if column in chunk.columns:
                min_val = min(min_val, chunk[column].min())
                max_val = max(max_val, chunk[column].max())
        
        # Create bins
        bin_edges = np.linspace(min_val, max_val, bins + 1)
        hist = np.zeros(bins)
        
        # Second pass: build histogram
        for chunk in pd.read_csv(file_path, chunksize=self.chunk_size):
            if column in chunk.columns:
                chunk_hist, _ = np.histogram(chunk[column].dropna(), bins=bin_edges)
                hist += chunk_hist
        
        return hist, bin_edges
    
    def memory_efficient_describe(self, file_path: str) -> pd.DataFrame:
        """Generate describe() output for large files."""
        self.logger.info("Generating descriptive statistics")
        
        # Online algorithm for statistics
        n = 0
        mean = None
        m2 = None
        min_vals = None
        max_vals = None
        
        for chunk in pd.read_csv(file_path, chunksize=self.chunk_size):
            numeric_chunk = chunk.select_dtypes(include=[np.number])
            
            if mean is None:
                mean = numeric_chunk.mean()
                m2 = ((numeric_chunk - mean) ** 2).sum()
                min_vals = numeric_chunk.min()
                max_vals = numeric_chunk.max()
            else:
                # Welford online algorithm
                delta = numeric_chunk - mean
                mean += delta.sum() / (n + len(chunk))
                m2 += ((numeric_chunk - mean) ** 2).sum()
                min_vals = pd.concat([min_vals, numeric_chunk.min()], axis=1).min(axis=1)
                max_vals = pd.concat([max_vals, numeric_chunk.max()], axis=1).max(axis=1)
            
            n += len(chunk)
        
        variance = m2 / n
        std = np.sqrt(variance)
        
        return pd.DataFrame({
            'count': n,
            'mean': mean,
            'std': std,
            'min': min_vals,
            'max': max_vals
        })


# Example usage
eda = LargeScaleEDA(chunk_size=100000)
print("Large-scale EDA tools ready")

## 4. Distributed Computing with Dask

In [None]:
# Initialize Dask client
# client = Client(n_workers=4, threads_per_worker=2, memory_limit='4GB')
# print(client)

def dask_eda_example(file_pattern: str) -> Dict:
    """Perform EDA using Dask for distributed computing."""
    
    # Read large CSV with Dask
    ddf = dd.read_csv(
        file_pattern,
        blocksize='64MB',
        dtype={'user_id': 'int32', 'campaign_id': 'int16'},
        parse_dates=['timestamp']
    )
    
    logger.info(f"Loaded {ddf.npartitions} partitions")
    
    # Describe (lazy computation)
    description = ddf.describe()
    
    # Value counts
    value_counts = ddf['campaign_id'].value_counts()
    
    # Groupby aggregation
    grouped = ddf.groupby('campaign_id').agg({
        'revenue': ['sum', 'mean', 'std', 'count'],
        'user_id': 'nunique'
    })
    
    # Compute all (triggers execution)
    results = {
        'description': description.compute(),
        'value_counts': value_counts.compute(),
        'grouped': grouped.compute()
    }
    
    return results


def dask_correlation_analysis(ddf: dd.DataFrame, columns: List[str]) -> pd.DataFrame:
    """Compute correlation matrix using Dask."""
    
    logger.info("Computing correlation matrix with Dask")
    
    # Select numeric columns
    numeric_ddf = ddf[columns]
    
    # Compute correlation (Dask will parallelize this)
    corr_matrix = numeric_ddf.corr().compute()
    
    return corr_matrix


def dask_time_series_analysis(ddf: dd.DataFrame, date_col: str, value_col: str) -> pd.DataFrame:
    """Perform time series aggregation with Dask."""
    
    # Set index
    ddf = ddf.set_index(date_col)
    
    # Resample by day
    daily = ddf[value_col].resample('D').agg(['sum', 'mean', 'count'])
    
    # Compute
    return daily.compute()


print("Dask EDA functions defined")

## 5. Advanced Outlier Detection at Scale

In [None]:
class OutlierDetector:
    """Advanced outlier detection for large datasets."""
    
    def __init__(self, method: str = 'iqr'):
        self.method = method
        self.logger = logging.getLogger(self.__class__.__name__)
    
    def detect_univariate(self, df: pd.DataFrame, column: str) -> pd.Series:
        """Detect univariate outliers."""
        
        if self.method == 'iqr':
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            return (df[column] < lower_bound) | (df[column] > upper_bound)
        
        elif self.method == 'zscore':
            z_scores = np.abs(stats.zscore(df[column].dropna()))
            return z_scores > 3
        
        elif self.method == 'modified_zscore':
            median = df[column].median()
            mad = robust.mad(df[column].dropna())
            modified_z = 0.6745 * (df[column] - median) / mad
            return np.abs(modified_z) > 3.5
    
    def detect_multivariate(self, df: pd.DataFrame, columns: List[str], 
                          contamination: float = 0.1) -> np.ndarray:
        """Detect multivariate outliers using Isolation Forest."""
        
        self.logger.info("Detecting multivariate outliers with Isolation Forest")
        
        # Prepare data
        X = df[columns].dropna()
        
        # Fit Isolation Forest
        iso_forest = IsolationForest(
            contamination=contamination,
            random_state=42,
            n_jobs=-1
        )
        
        outliers = iso_forest.fit_predict(X)
        
        # -1 for outliers, 1 for inliers
        return outliers == -1
    
    def detect_mahalanobis(self, df: pd.DataFrame, columns: List[str]) -> np.ndarray:
        """Detect outliers using Mahalanobis distance."""
        
        self.logger.info("Detecting outliers with Mahalanobis distance")
        
        X = df[columns].dropna()
        
        # Robust covariance estimation
        robust_cov = EllipticEnvelope(contamination=0.1, random_state=42)
        outliers = robust_cov.fit_predict(X)
        
        return outliers == -1
    
    def detect_chunked(self, file_path: str, column: str, chunk_size: int = 100000) -> List:
        """Detect outliers in chunks for very large files."""
        
        self.logger.info(f"Detecting outliers in {column} using chunked processing")
        
        # First pass: calculate statistics
        values = []
        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
            values.extend(chunk[column].dropna().tolist())
        
        # Calculate bounds
        Q1 = np.percentile(values, 25)
        Q3 = np.percentile(values, 75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Second pass: flag outliers
        outlier_indices = []
        offset = 0
        
        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
            chunk_outliers = (chunk[column] < lower_bound) | (chunk[column] > upper_bound)
            outlier_indices.extend((chunk_outliers[chunk_outliers].index + offset).tolist())
            offset += len(chunk)
        
        return outlier_indices


# Example: Detect outliers
detector = OutlierDetector(method='iqr')
print("Outlier detector initialized")

## 6. Correlation Analysis at Scale

In [None]:
class ScalableCorrelation:
    """Compute correlations for large datasets."""
    
    def __init__(self):
        self.logger = logging.getLogger(self.__class__.__name__)
    
    def incremental_correlation(self, file_path: str, columns: List[str], 
                               chunk_size: int = 100000) -> pd.DataFrame:
        """Calculate correlation incrementally."""
        
        self.logger.info("Computing incremental correlation")
        
        n = 0
        means = None
        cov_matrix = None
        
        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
            chunk_data = chunk[columns].dropna()
            
            if means is None:
                means = chunk_data.mean()
                cov_matrix = np.zeros((len(columns), len(columns)))
            
            # Update means and covariance
            for i, col1 in enumerate(columns):
                for j, col2 in enumerate(columns):
                    if i <= j:
                        cov = ((chunk_data[col1] - means[col1]) * 
                               (chunk_data[col2] - means[col2])).sum()
                        cov_matrix[i, j] += cov
                        cov_matrix[j, i] = cov_matrix[i, j]
            
            n += len(chunk_data)
        
        # Convert covariance to correlation
        cov_matrix /= n
        std = np.sqrt(np.diag(cov_matrix))
        corr_matrix = cov_matrix / np.outer(std, std)
        
        return pd.DataFrame(corr_matrix, index=columns, columns=columns)
    
    def sparse_correlation(self, df: pd.DataFrame, threshold: float = 0.3) -> pd.DataFrame:
        """Find only significant correlations above threshold."""
        
        corr = df.corr()
        
        # Create mask for significant correlations
        mask = (np.abs(corr) >= threshold) & (corr != 1.0)
        
        # Extract significant pairs
        significant_pairs = []
        for i in range(len(corr.columns)):
            for j in range(i+1, len(corr.columns)):
                if mask.iloc[i, j]:
                    significant_pairs.append({
                        'var1': corr.columns[i],
                        'var2': corr.columns[j],
                        'correlation': corr.iloc[i, j]
                    })
        
        return pd.DataFrame(significant_pairs).sort_values('correlation', 
                                                           key=abs, 
                                                           ascending=False)
    
    def visualize_correlation_heatmap(self, corr_matrix: pd.DataFrame, 
                                     title: str = 'Correlation Matrix'):
        """Create interactive correlation heatmap."""
        
        fig = go.Figure(data=go.Heatmap(
            z=corr_matrix.values,
            x=corr_matrix.columns,
            y=corr_matrix.columns,
            colorscale='RdBu',
            zmid=0,
            text=corr_matrix.values,
            texttemplate='%{text:.2f}',
            textfont={"size": 10}
        ))
        
        fig.update_layout(
            title=title,
            width=800,
            height=800
        )
        
        return fig


corr_analyzer = ScalableCorrelation()
print("Correlation analyzer ready")

## 7. Automated Data Profiling Pipelines

In [None]:
class AutomatedProfiler:
    """Automated data profiling for large datasets."""
    
    def __init__(self):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.profile_results = {}
    
    def profile_table(self, df: pd.DataFrame, sample_size: int = 100000) -> Dict:
        """Generate comprehensive profile of a table."""
        
        self.logger.info("Profiling dataset")
        
        # Sample if too large
        if len(df) > sample_size:
            df_sample = df.sample(n=sample_size, random_state=42)
        else:
            df_sample = df
        
        profile = {
            'shape': df.shape,
            'memory_mb': df.memory_usage(deep=True).sum() / 1024**2,
            'dtypes': df.dtypes.value_counts().to_dict(),
            'missing': df.isnull().sum().to_dict(),
            'missing_pct': (df.isnull().sum() / len(df) * 100).to_dict(),
            'duplicates': df.duplicated().sum(),
            'columns': {}
        }
        
        # Profile each column
        for col in df.columns:
            profile['columns'][col] = self._profile_column(df_sample, col)
        
        return profile
    
    def _profile_column(self, df: pd.DataFrame, column: str) -> Dict:
        """Profile individual column."""
        
        col_profile = {
            'dtype': str(df[column].dtype),
            'missing': df[column].isnull().sum(),
            'missing_pct': df[column].isnull().sum() / len(df) * 100,
            'unique': df[column].nunique(),
            'unique_pct': df[column].nunique() / len(df) * 100
        }
        
        # Numeric columns
        if pd.api.types.is_numeric_dtype(df[column]):
            col_profile.update({
                'mean': df[column].mean(),
                'std': df[column].std(),
                'min': df[column].min(),
                'q25': df[column].quantile(0.25),
                'median': df[column].median(),
                'q75': df[column].quantile(0.75),
                'max': df[column].max(),
                'skew': df[column].skew(),
                'kurtosis': df[column].kurtosis(),
                'zeros': (df[column] == 0).sum(),
                'negative': (df[column] < 0).sum()
            })
        
        # Categorical columns
        elif pd.api.types.is_object_dtype(df[column]):
            value_counts = df[column].value_counts()
            col_profile.update({
                'top_values': value_counts.head(10).to_dict(),
                'cardinality': len(value_counts),
                'mode': df[column].mode()[0] if len(df[column].mode()) > 0 else None
            })
        
        return col_profile
    
    def data_quality_report(self, df: pd.DataFrame) -> pd.DataFrame:
        """Generate data quality report."""
        
        quality_metrics = []
        
        for col in df.columns:
            metrics = {
                'column': col,
                'dtype': str(df[col].dtype),
                'completeness': (1 - df[col].isnull().sum() / len(df)) * 100,
                'uniqueness': df[col].nunique() / len(df) * 100,
                'validity': self._check_validity(df, col)
            }
            quality_metrics.append(metrics)
        
        return pd.DataFrame(quality_metrics)
    
    def _check_validity(self, df: pd.DataFrame, column: str) -> float:
        """Check data validity."""
        
        total = len(df)
        valid = total - df[column].isnull().sum()
        
        # Additional checks for numeric columns
        if pd.api.types.is_numeric_dtype(df[column]):
            # Check for inf values
            valid -= np.isinf(df[column]).sum()
        
        return (valid / total) * 100
    
    def generate_html_report(self, profile: Dict, output_file: str):
        """Generate HTML report from profile."""
        
        html = f"""
        <html>
        <head><title>Data Profile Report</title></head>
        <body>
        <h1>Data Profile Report</h1>
        <h2>Overview</h2>
        <p>Shape: {profile['shape']}</p>
        <p>Memory: {profile['memory_mb']:.2f} MB</p>
        <p>Duplicates: {profile['duplicates']}</p>
        </body>
        </html>
        """
        
        with open(output_file, 'w') as f:
            f.write(html)
        
        self.logger.info(f"Report saved to {output_file}")


profiler = AutomatedProfiler()
print("Automated profiler ready")

## 8. Sampling Strategies for Large Datasets

In [None]:
class SamplingStrategies:
    """Advanced sampling techniques for large datasets."""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
        self.logger = logging.getLogger(self.__class__.__name__)
    
    def simple_random_sample(self, df: pd.DataFrame, n: int) -> pd.DataFrame:
        """Simple random sampling."""
        return df.sample(n=n, random_state=self.random_state)
    
    def stratified_sample(self, df: pd.DataFrame, strata_col: str, 
                         n_per_stratum: int) -> pd.DataFrame:
        """Stratified random sampling."""
        
        self.logger.info(f"Stratified sampling on {strata_col}")
        
        samples = []
        for stratum in df[strata_col].unique():
            stratum_df = df[df[strata_col] == stratum]
            sample_size = min(n_per_stratum, len(stratum_df))
            samples.append(stratum_df.sample(n=sample_size, random_state=self.random_state))
        
        return pd.concat(samples, ignore_index=True)
    
    def systematic_sample(self, df: pd.DataFrame, k: int) -> pd.DataFrame:
        """Systematic sampling (every kth row)."""
        return df.iloc[::k]
    
    def reservoir_sample(self, file_path: str, n: int, 
                        chunk_size: int = 100000) -> pd.DataFrame:
        """Reservoir sampling for streaming data."""
        
        self.logger.info(f"Reservoir sampling {n} rows")
        
        reservoir = []
        count = 0
        
        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
            for _, row in chunk.iterrows():
                count += 1
                
                if len(reservoir) < n:
                    reservoir.append(row)
                else:
                    # Random replacement
                    j = np.random.randint(0, count)
                    if j < n:
                        reservoir[j] = row
        
        return pd.DataFrame(reservoir)
    
    def time_based_sample(self, df: pd.DataFrame, date_col: str, 
                         freq: str = 'W') -> pd.DataFrame:
        """Time-based sampling."""
        
        df[date_col] = pd.to_datetime(df[date_col])
        return df.set_index(date_col).resample(freq).first().reset_index()
    
    def evaluate_sample_quality(self, original: pd.DataFrame, 
                               sample: pd.DataFrame,
                               numeric_cols: List[str]) -> pd.DataFrame:
        """Evaluate how well sample represents population."""
        
        comparison = []
        
        for col in numeric_cols:
            comparison.append({
                'column': col,
                'pop_mean': original[col].mean(),
                'sample_mean': sample[col].mean(),
                'mean_diff': abs(original[col].mean() - sample[col].mean()),
                'pop_std': original[col].std(),
                'sample_std': sample[col].std(),
                'std_diff': abs(original[col].std() - sample[col].std())
            })
        
        return pd.DataFrame(comparison)


sampler = SamplingStrategies(random_state=42)
print("Sampling strategies ready")

## 9. Real-World Project: Profile 100M Row Marketing Database

### Comprehensive Marketing Database Profiling System

In [None]:
class MarketingDatabaseProfiler:
    """Profile massive marketing database with 100M+ rows."""
    
    def __init__(self, rs_config: Dict):
        self.rs = RedshiftEDA(rs_config)
        self.profiler = AutomatedProfiler()
        self.sampler = SamplingStrategies()
        self.outlier_detector = OutlierDetector()
        self.logger = logging.getLogger(self.__class__.__name__)
        self.results = {}
    
    def execute_full_profile(self, table: str, date_col: str = 'event_date',
                           start_date: str = '2024-01-01',
                           end_date: str = '2024-12-31') -> Dict:
        """Execute comprehensive profiling pipeline."""
        
        self.logger.info(f"Starting comprehensive profile of {table}")
        start_time = time.time()
        
        # 1. Get table overview
        self.logger.info("Step 1: Table overview")
        overview_query = f"""
        SELECT 
            COUNT(*) as total_rows,
            COUNT(DISTINCT user_id) as unique_users,
            COUNT(DISTINCT campaign_id) as unique_campaigns,
            MIN({date_col}) as min_date,
            MAX({date_col}) as max_date,
            SUM(revenue) as total_revenue
        FROM {table}
        WHERE {date_col} BETWEEN '{start_date}' AND '{end_date}'
        """
        self.results['overview'] = pd.read_sql(overview_query, self.rs.engine).to_dict('records')[0]
        
        # 2. Sample data for detailed analysis
        self.logger.info("Step 2: Sampling data")
        sample_query = f"""
        SELECT *
        FROM {table}
        WHERE {date_col} BETWEEN '{start_date}' AND '{end_date}'
          AND RANDOM() < 1000000.0 / (SELECT COUNT(*) FROM {table})
        LIMIT 1000000
        """
        sample_df = pd.read_sql(sample_query, self.rs.engine)
        self.results['sample_size'] = len(sample_df)
        
        # 3. Profile sample
        self.logger.info("Step 3: Profiling sample")
        self.results['profile'] = self.profiler.profile_table(sample_df)
        
        # 4. Statistical analysis
        self.logger.info("Step 4: Statistical analysis")
        numeric_cols = sample_df.select_dtypes(include=[np.number]).columns.tolist()
        self.results['statistics'] = sample_df[numeric_cols].describe().to_dict()
        
        # 5. Outlier detection
        self.logger.info("Step 5: Outlier detection")
        self.results['outliers'] = {}
        for col in ['revenue', 'clicks', 'impressions']:
            if col in sample_df.columns:
                outliers = self.outlier_detector.detect_univariate(sample_df, col)
                self.results['outliers'][col] = outliers.sum()
        
        # 6. Correlation analysis
        self.logger.info("Step 6: Correlation analysis")
        if len(numeric_cols) > 1:
            self.results['correlations'] = sample_df[numeric_cols].corr().to_dict()
        
        # 7. Data quality metrics
        self.logger.info("Step 7: Data quality assessment")
        self.results['quality'] = self.profiler.data_quality_report(sample_df).to_dict('records')
        
        # 8. Time series analysis
        self.logger.info("Step 8: Time series trends")
        if date_col in sample_df.columns:
            sample_df[date_col] = pd.to_datetime(sample_df[date_col])
            daily_stats = sample_df.groupby(sample_df[date_col].dt.date).agg({
                'revenue': ['sum', 'mean', 'count']
            })
            self.results['daily_trends'] = daily_stats.to_dict()
        
        # 9. Channel analysis
        self.logger.info("Step 9: Channel performance")
        if 'channel' in sample_df.columns:
            channel_stats = sample_df.groupby('channel').agg({
                'revenue': ['sum', 'mean', 'count'],
                'user_id': 'nunique'
            })
            self.results['channel_performance'] = channel_stats.to_dict()
        
        elapsed = time.time() - start_time
        self.results['execution_time'] = elapsed
        
        self.logger.info(f"Profiling complete in {elapsed:.2f}s")
        
        return self.results
    
    def generate_executive_summary(self) -> str:
        """Generate executive summary of findings."""
        
        summary = f"""
        MARKETING DATABASE PROFILE SUMMARY
        ==================================
        
        Total Rows: {self.results['overview']['total_rows']:,}
        Unique Users: {self.results['overview']['unique_users']:,}
        Unique Campaigns: {self.results['overview']['unique_campaigns']:,}
        Date Range: {self.results['overview']['min_date']} to {self.results['overview']['max_date']}
        Total Revenue: ${self.results['overview']['total_revenue']:,.2f}
        
        Sample Size: {self.results['sample_size']:,} rows
        
        Data Quality:
        - Completeness: {np.mean([q['completeness'] for q in self.results['quality']]):.1f}%
        - Outliers Detected: {sum(self.results['outliers'].values()):,}
        
        Execution Time: {self.results['execution_time']:.2f}s
        """
        
        return summary


# Example usage
# profiler = MarketingDatabaseProfiler(REDSHIFT_CONFIG)
# results = profiler.execute_full_profile('marketing_events')
# print(profiler.generate_executive_summary())

print("Marketing database profiler ready")

## 10. Exercises

### Exercise 1: Large-Scale Statistical Analysis
Using a 10M+ row dataset:
1. Calculate comprehensive statistics without loading entire dataset into memory
2. Identify distributions of key metrics
3. Detect outliers using multiple methods
4. Generate statistical report

### Exercise 2: Distributed EDA with Dask
1. Load a large dataset using Dask
2. Perform groupby aggregations
3. Compute correlation matrix
4. Compare performance with pandas

### Exercise 3: Automated Profiling Pipeline
Build an automated profiling system that:
1. Profiles all tables in a Redshift schema
2. Generates quality metrics
3. Identifies data issues
4. Creates HTML reports
5. Sends alerts for quality issues

### Exercise 4: Sampling Strategy Comparison
1. Implement 3+ sampling strategies
2. Compare sample quality metrics
3. Measure bias in each method
4. Recommend best strategy for different scenarios

## Resources

### Documentation
- [Dask Documentation](https://docs.dask.org/)
- [Pandas Profiling](https://pandas-profiling.github.io/)
- [SciPy Statistics](https://docs.scipy.org/doc/scipy/reference/stats.html)
- [Scikit-learn Outlier Detection](https://scikit-learn.org/stable/modules/outlier_detection.html)

### Papers
- Scalable Data Profiling - IEEE
- Outlier Detection at Scale - KDD
- Sampling Techniques for Big Data - ACM

### Tools
- Great Expectations: Data quality framework
- Pandas Profiling: Automated EDA
- Dask: Distributed computing
- PyOD: Outlier detection library