# Week 4: Advanced SQL Analytics for Redshift

## Learning Objectives
- Implement advanced analytics with Redshift SQL
- Create and use Python UDFs in Redshift
- Master recursive CTEs for complex hierarchies
- Perform graph analytics for attribution paths
- Analyze time-series data at scale
- Build advanced aggregations and pivots
- Implement data quality checks at scale
- Optimize multi-touch attribution queries

## Prerequisites
```bash
pip install pandas psycopg2-binary sqlalchemy numpy
```

## 1. Setup and Configuration

In [None]:
import os
import pandas as pd
import numpy as np
import psycopg2
from sqlalchemy import create_engine, text
import logging
import time
from typing import Dict, List, Tuple
import json

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Redshift configuration
REDSHIFT_CONFIG = {
    'host': os.getenv('REDSHIFT_HOST', 'your-cluster.region.redshift.amazonaws.com'),
    'port': int(os.getenv('REDSHIFT_PORT', 5439)),
    'database': os.getenv('REDSHIFT_DB', 'marketing'),
    'user': os.getenv('REDSHIFT_USER', 'analyst'),
    'password': os.getenv('REDSHIFT_PASSWORD', 'your-password')
}

# Create connection
connection_string = (
    f"postgresql+psycopg2://{REDSHIFT_CONFIG['user']}:{REDSHIFT_CONFIG['password']}"
    f"@{REDSHIFT_CONFIG['host']}:{REDSHIFT_CONFIG['port']}/{REDSHIFT_CONFIG['database']}"
)

engine = create_engine(connection_string, pool_pre_ping=True)

def execute_query(query: str, params: Dict = None) -> pd.DataFrame:
    """Execute query and return results as DataFrame."""
    start_time = time.time()
    
    with engine.connect() as conn:
        result = conn.execute(text(query), params or {})
        df = pd.DataFrame(result.fetchall(), columns=result.keys())
        
    elapsed = time.time() - start_time
    logger.info(f"Query returned {len(df):,} rows in {elapsed:.2f}s")
    
    return df

print("✓ Connected to Redshift")

## 2. User-Defined Functions (UDFs)

### 2.1 Python UDFs in Redshift

In [None]:
# Example 1: Scalar Python UDF for advanced string manipulation
create_udf_extract_domain = """
CREATE OR REPLACE FUNCTION f_extract_domain(email VARCHAR)
RETURNS VARCHAR
IMMUTABLE
AS $$
    if email and '@' in email:
        return email.split('@')[1].lower()
    return None
$$ LANGUAGE plpythonu;
"""

# Example 2: UDF for JSON parsing
create_udf_parse_json = """
CREATE OR REPLACE FUNCTION f_parse_json_field(json_str VARCHAR, field_name VARCHAR)
RETURNS VARCHAR
STABLE
AS $$
    import json
    try:
        data = json.loads(json_str)
        return str(data.get(field_name, ''))
    except:
        return None
$$ LANGUAGE plpythonu;
"""

# Example 3: UDF for custom scoring logic
create_udf_lead_score = """
CREATE OR REPLACE FUNCTION f_calculate_lead_score(
    page_views INTEGER,
    email_opens INTEGER,
    downloads INTEGER,
    recency_days INTEGER
)
RETURNS INTEGER
IMMUTABLE
AS $$
    # Scoring logic
    score = 0
    
    # Page views (max 30 points)
    score += min(page_views * 2, 30)
    
    # Email engagement (max 25 points)
    score += min(email_opens * 5, 25)
    
    # Downloads (max 30 points)
    score += min(downloads * 10, 30)
    
    # Recency bonus (max 15 points)
    if recency_days <= 7:
        score += 15
    elif recency_days <= 30:
        score += 10
    elif recency_days <= 90:
        score += 5
    
    return min(score, 100)
$$ LANGUAGE plpythonu;
"""

# Usage example
use_udf_example = """
SELECT 
    user_id,
    email,
    f_extract_domain(email) as email_domain,
    page_views,
    email_opens,
    downloads,
    recency_days,
    f_calculate_lead_score(page_views, email_opens, downloads, recency_days) as lead_score
FROM user_engagement
WHERE f_calculate_lead_score(page_views, email_opens, downloads, recency_days) >= 70;
"""

print("""
Python UDF Best Practices:
-------------------------
1. Use IMMUTABLE when function always returns same result for same inputs
2. Use STABLE when result depends on database state but not time
3. Use VOLATILE when result can change (least optimal)
4. Keep UDFs simple - complex logic should be in application layer
5. Test UDFs thoroughly with edge cases
6. Document UDF behavior and expected inputs
""")

## 3. Recursive CTEs for Hierarchical Data

In [None]:
# Example 1: User referral hierarchy
recursive_referrals = """
WITH RECURSIVE referral_chain AS (
    -- Base case: users with no referrer (level 0)
    SELECT 
        user_id,
        referred_by_user_id,
        user_id as root_user_id,
        0 as level,
        CAST(user_id AS VARCHAR) as path
    FROM users
    WHERE referred_by_user_id IS NULL
    
    UNION ALL
    
    -- Recursive case: users referred by previous level
    SELECT 
        u.user_id,
        u.referred_by_user_id,
        rc.root_user_id,
        rc.level + 1,
        rc.path || ' -> ' || CAST(u.user_id AS VARCHAR)
    FROM users u
    INNER JOIN referral_chain rc ON u.referred_by_user_id = rc.user_id
    WHERE rc.level < 10  -- Prevent infinite loops
)
SELECT 
    root_user_id,
    COUNT(*) as total_referrals,
    MAX(level) as max_depth,
    AVG(level) as avg_depth
FROM referral_chain
GROUP BY root_user_id
HAVING COUNT(*) > 10
ORDER BY total_referrals DESC;
"""

# Example 2: Campaign hierarchy (parent-child campaigns)
recursive_campaign_hierarchy = """
WITH RECURSIVE campaign_tree AS (
    -- Root campaigns
    SELECT 
        campaign_id,
        parent_campaign_id,
        campaign_name,
        budget,
        0 as level,
        CAST(campaign_name AS VARCHAR(1000)) as hierarchy_path
    FROM campaigns
    WHERE parent_campaign_id IS NULL
    
    UNION ALL
    
    -- Child campaigns
    SELECT 
        c.campaign_id,
        c.parent_campaign_id,
        c.campaign_name,
        c.budget,
        ct.level + 1,
        ct.hierarchy_path || ' > ' || c.campaign_name
    FROM campaigns c
    INNER JOIN campaign_tree ct ON c.parent_campaign_id = ct.campaign_id
    WHERE ct.level < 5
)
SELECT 
    campaign_id,
    campaign_name,
    level,
    hierarchy_path,
    budget,
    SUM(budget) OVER (PARTITION BY SPLIT_PART(hierarchy_path, ' > ', 1)) as root_total_budget
FROM campaign_tree
ORDER BY hierarchy_path;
"""

# Example 3: Event sequence chains
recursive_event_sequences = """
WITH RECURSIVE event_chain AS (
    -- First event for each user
    SELECT 
        user_id,
        event_id,
        event_type,
        channel,
        timestamp,
        1 as step_number,
        event_type as event_path,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp) as rn
    FROM marketing_events
    WHERE timestamp >= '2024-01-01'
),
first_events AS (
    SELECT *
    FROM event_chain
    WHERE rn = 1
),
chain AS (
    SELECT 
        user_id,
        event_id,
        event_type,
        channel,
        timestamp,
        step_number,
        event_path
    FROM first_events
    
    UNION ALL
    
    SELECT 
        e.user_id,
        e.event_id,
        e.event_type,
        e.channel,
        e.timestamp,
        c.step_number + 1,
        c.event_path || ' -> ' || e.event_type
    FROM marketing_events e
    INNER JOIN chain c 
        ON e.user_id = c.user_id 
        AND e.timestamp > c.timestamp
    WHERE c.step_number < 20
      AND e.timestamp <= c.timestamp + INTERVAL '7 days'
)
SELECT 
    event_path,
    COUNT(DISTINCT user_id) as user_count,
    AVG(step_number) as avg_steps
FROM chain
WHERE event_path LIKE '%conversion'
GROUP BY event_path
ORDER BY user_count DESC
LIMIT 100;
"""

print("""
Recursive CTE Best Practices:
----------------------------
1. Always include termination condition (level < N)
2. Monitor query performance - recursion can be expensive
3. Use appropriate indexes on join columns
4. Consider materialized views for frequently-used hierarchies
5. Test with realistic data volumes
""")

## 4. Graph Analytics for Attribution Paths

In [None]:
# Example 1: Build user journey graph
user_journey_graph = """
WITH user_events AS (
    SELECT 
        user_id,
        channel,
        event_type,
        timestamp,
        revenue,
        LAG(channel) OVER (PARTITION BY user_id ORDER BY timestamp) as prev_channel,
        LEAD(channel) OVER (PARTITION BY user_id ORDER BY timestamp) as next_channel,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp) as position
    FROM marketing_events
    WHERE event_type IN ('impression', 'click', 'conversion')
),
channel_transitions AS (
    SELECT 
        prev_channel as from_channel,
        channel as to_channel,
        COUNT(*) as transition_count,
        COUNT(DISTINCT user_id) as user_count,
        SUM(CASE WHEN event_type = 'conversion' THEN revenue ELSE 0 END) as attributed_revenue
    FROM user_events
    WHERE prev_channel IS NOT NULL
    GROUP BY prev_channel, channel
)
SELECT 
    from_channel,
    to_channel,
    transition_count,
    user_count,
    attributed_revenue,
    transition_count * 1.0 / SUM(transition_count) OVER (PARTITION BY from_channel) as transition_probability
FROM channel_transitions
ORDER BY transition_count DESC;
"""

# Example 2: Calculate path values (Shapley value approximation)
path_value_analysis = """
WITH conversion_paths AS (
    SELECT 
        user_id,
        LISTAGG(channel, ' > ') WITHIN GROUP (ORDER BY timestamp) as full_path,
        COUNT(*) as touchpoints,
        MAX(CASE WHEN event_type = 'conversion' THEN revenue ELSE 0 END) as conversion_value
    FROM marketing_events
    WHERE timestamp >= '2024-01-01'
    GROUP BY user_id
    HAVING MAX(CASE WHEN event_type = 'conversion' THEN 1 ELSE 0 END) = 1
),
path_stats AS (
    SELECT 
        full_path,
        COUNT(*) as path_frequency,
        SUM(conversion_value) as total_value,
        AVG(conversion_value) as avg_value,
        AVG(touchpoints) as avg_touchpoints
    FROM conversion_paths
    GROUP BY full_path
)
SELECT 
    full_path,
    path_frequency,
    total_value,
    avg_value,
    avg_touchpoints,
    path_frequency * 1.0 / SUM(path_frequency) OVER () as path_percentage
FROM path_stats
WHERE path_frequency >= 10
ORDER BY total_value DESC
LIMIT 50;
"""

# Example 3: Channel network analysis
channel_network_metrics = """
WITH channel_pairs AS (
    SELECT 
        user_id,
        channel as current_channel,
        LAG(channel, 1) OVER (PARTITION BY user_id ORDER BY timestamp) as prev_1,
        LAG(channel, 2) OVER (PARTITION BY user_id ORDER BY timestamp) as prev_2,
        event_type,
        revenue
    FROM marketing_events
),
channel_influence AS (
    SELECT 
        current_channel,
        prev_1,
        -- Direct influence (immediate predecessor)
        COUNT(*) as direct_count,
        SUM(CASE WHEN event_type = 'conversion' THEN 1 ELSE 0 END) as direct_conversions,
        SUM(CASE WHEN event_type = 'conversion' THEN revenue ELSE 0 END) as direct_revenue
    FROM channel_pairs
    WHERE prev_1 IS NOT NULL
    GROUP BY current_channel, prev_1
)
SELECT 
    prev_1 as influencing_channel,
    current_channel as influenced_channel,
    direct_count,
    direct_conversions,
    direct_revenue,
    direct_conversions * 1.0 / NULLIF(direct_count, 0) as conversion_rate,
    direct_revenue / NULLIF(direct_conversions, 0) as avg_conversion_value
FROM channel_influence
WHERE direct_count >= 100
ORDER BY direct_revenue DESC;
"""

print("✓ Graph analytics queries defined")

## 5. Time-Series Analysis at Scale

In [None]:
# Example 1: Time-series decomposition (trend, seasonality)
time_series_decomposition = """
WITH daily_metrics AS (
    SELECT 
        DATE_TRUNC('day', timestamp) as date,
        channel,
        SUM(revenue) as daily_revenue
    FROM marketing_events
    WHERE timestamp >= DATEADD(year, -2, GETDATE())
    GROUP BY DATE_TRUNC('day', timestamp), channel
),
metrics_with_trend AS (
    SELECT 
        date,
        channel,
        daily_revenue,
        -- 30-day moving average (trend)
        AVG(daily_revenue) OVER (
            PARTITION BY channel 
            ORDER BY date 
            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
        ) as trend_30d,
        -- Day of week pattern
        EXTRACT(DOW FROM date) as day_of_week,
        -- Week of year pattern
        EXTRACT(WEEK FROM date) as week_of_year
    FROM daily_metrics
),
seasonality AS (
    SELECT 
        channel,
        day_of_week,
        AVG(daily_revenue / NULLIF(trend_30d, 0)) as dow_seasonality_factor
    FROM metrics_with_trend
    WHERE trend_30d > 0
    GROUP BY channel, day_of_week
)
SELECT 
    m.date,
    m.channel,
    m.daily_revenue as actual,
    m.trend_30d as trend,
    s.dow_seasonality_factor as seasonality,
    m.daily_revenue - m.trend_30d as detrended,
    m.daily_revenue - (m.trend_30d * s.dow_seasonality_factor) as residual
FROM metrics_with_trend m
LEFT JOIN seasonality s ON m.channel = s.channel AND m.day_of_week = s.day_of_week
WHERE m.date >= DATEADD(day, -90, GETDATE())
ORDER BY m.channel, m.date;
"""

# Example 2: Forecasting with linear regression
revenue_forecast = """
WITH daily_revenue AS (
    SELECT 
        DATE_TRUNC('day', timestamp) as date,
        channel,
        SUM(revenue) as revenue,
        ROW_NUMBER() OVER (PARTITION BY channel ORDER BY DATE_TRUNC('day', timestamp)) as day_number
    FROM marketing_events
    WHERE timestamp >= DATEADD(day, -90, GETDATE())
    GROUP BY DATE_TRUNC('day', timestamp), channel
),
regression_stats AS (
    SELECT 
        channel,
        COUNT(*) as n,
        AVG(day_number) as avg_x,
        AVG(revenue) as avg_y,
        SUM((day_number - AVG(day_number) OVER (PARTITION BY channel)) * 
            (revenue - AVG(revenue) OVER (PARTITION BY channel))) as sum_xy,
        SUM(POWER(day_number - AVG(day_number) OVER (PARTITION BY channel), 2)) as sum_xx
    FROM daily_revenue
    GROUP BY channel, day_number, revenue
),
coefficients AS (
    SELECT 
        channel,
        SUM(sum_xy) / NULLIF(SUM(sum_xx), 0) as slope,
        AVG(avg_y) - (SUM(sum_xy) / NULLIF(SUM(sum_xx), 0)) * AVG(avg_x) as intercept
    FROM regression_stats
    GROUP BY channel
)
SELECT 
    c.channel,
    c.slope,
    c.intercept,
    -- Forecast next 7 days
    c.intercept + c.slope * 91 as forecast_day_91,
    c.intercept + c.slope * 97 as forecast_day_97,
    CASE 
        WHEN c.slope > 0 THEN 'Growing'
        WHEN c.slope < 0 THEN 'Declining'
        ELSE 'Stable'
    END as trend_direction
FROM coefficients c;
"""

# Example 3: Change point detection
change_point_detection = """
WITH daily_metrics AS (
    SELECT 
        DATE_TRUNC('day', timestamp) as date,
        channel,
        SUM(revenue) as revenue,
        COUNT(*) as event_count
    FROM marketing_events
    WHERE timestamp >= DATEADD(day, -60, GETDATE())
    GROUP BY DATE_TRUNC('day', timestamp), channel
),
rolling_stats AS (
    SELECT 
        date,
        channel,
        revenue,
        AVG(revenue) OVER (
            PARTITION BY channel 
            ORDER BY date 
            ROWS BETWEEN 13 PRECEDING AND CURRENT ROW
        ) as ma_14d,
        STDDEV(revenue) OVER (
            PARTITION BY channel 
            ORDER BY date 
            ROWS BETWEEN 13 PRECEDING AND CURRENT ROW
        ) as stddev_14d
    FROM daily_metrics
)
SELECT 
    date,
    channel,
    revenue,
    ma_14d,
    stddev_14d,
    (revenue - ma_14d) / NULLIF(stddev_14d, 0) as z_score,
    CASE 
        WHEN ABS((revenue - ma_14d) / NULLIF(stddev_14d, 0)) > 3 THEN 'Significant Change'
        WHEN ABS((revenue - ma_14d) / NULLIF(stddev_14d, 0)) > 2 THEN 'Notable Change'
        ELSE 'Normal'
    END as change_flag
FROM rolling_stats
WHERE date >= DATEADD(day, -30, GETDATE())
ORDER BY channel, date;
"""

# Example 4: Exponential smoothing
exponential_smoothing = """
WITH daily_data AS (
    SELECT 
        DATE_TRUNC('day', timestamp) as date,
        channel,
        SUM(revenue) as revenue,
        ROW_NUMBER() OVER (PARTITION BY channel ORDER BY DATE_TRUNC('day', timestamp)) as rn
    FROM marketing_events
    WHERE timestamp >= DATEADD(day, -90, GETDATE())
    GROUP BY DATE_TRUNC('day', timestamp), channel
),
smoothed AS (
    SELECT 
        date,
        channel,
        revenue,
        -- Simple exponential smoothing with alpha=0.3
        revenue as current_value,
        LAG(revenue) OVER (PARTITION BY channel ORDER BY date) as prev_value,
        0.3 * revenue + 0.7 * LAG(revenue) OVER (PARTITION BY channel ORDER BY date) as smoothed_value
    FROM daily_data
)
SELECT 
    date,
    channel,
    revenue as actual,
    smoothed_value as smoothed,
    revenue - smoothed_value as residual,
    POWER(revenue - smoothed_value, 2) as squared_error
FROM smoothed
WHERE smoothed_value IS NOT NULL
ORDER BY channel, date;
"""

print("✓ Time-series analysis queries defined")

## 6. Advanced Aggregations and Pivots

In [None]:
# Example 1: Dynamic pivot using CASE statements
pivot_revenue_by_channel = """
SELECT 
    DATE_TRUNC('month', timestamp) as month,
    SUM(CASE WHEN channel = 'email' THEN revenue ELSE 0 END) as email_revenue,
    SUM(CASE WHEN channel = 'social' THEN revenue ELSE 0 END) as social_revenue,
    SUM(CASE WHEN channel = 'search' THEN revenue ELSE 0 END) as search_revenue,
    SUM(CASE WHEN channel = 'display' THEN revenue ELSE 0 END) as display_revenue,
    SUM(revenue) as total_revenue,
    -- Calculate percentages
    SUM(CASE WHEN channel = 'email' THEN revenue ELSE 0 END) * 100.0 / NULLIF(SUM(revenue), 0) as email_pct,
    SUM(CASE WHEN channel = 'social' THEN revenue ELSE 0 END) * 100.0 / NULLIF(SUM(revenue), 0) as social_pct,
    SUM(CASE WHEN channel = 'search' THEN revenue ELSE 0 END) * 100.0 / NULLIF(SUM(revenue), 0) as search_pct,
    SUM(CASE WHEN channel = 'display' THEN revenue ELSE 0 END) * 100.0 / NULLIF(SUM(revenue), 0) as display_pct
FROM marketing_events
WHERE timestamp >= '2024-01-01'
GROUP BY DATE_TRUNC('month', timestamp)
ORDER BY month;
"""

# Example 2: ROLLUP for hierarchical aggregations
hierarchical_aggregation = """
SELECT 
    channel,
    campaign_id,
    event_type,
    COUNT(*) as event_count,
    SUM(revenue) as total_revenue,
    GROUPING(channel) as is_channel_total,
    GROUPING(campaign_id) as is_campaign_total,
    GROUPING(event_type) as is_event_type_total
FROM marketing_events
WHERE date >= '2024-01-01'
GROUP BY ROLLUP(channel, campaign_id, event_type)
ORDER BY channel, campaign_id, event_type;
"""

# Example 3: CUBE for multi-dimensional analysis
cube_analysis = """
SELECT 
    channel,
    DATE_TRUNC('month', timestamp) as month,
    event_type,
    COUNT(*) as events,
    SUM(revenue) as revenue,
    COUNT(DISTINCT user_id) as unique_users
FROM marketing_events
WHERE timestamp >= '2024-01-01'
GROUP BY CUBE(channel, DATE_TRUNC('month', timestamp), event_type)
ORDER BY channel, month, event_type;
"""

# Example 4: Percentile aggregations
percentile_analysis = """
SELECT 
    channel,
    COUNT(*) as total_events,
    AVG(revenue) as avg_revenue,
    MEDIAN(revenue) as median_revenue,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) as p25_revenue,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as p75_revenue,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY revenue) as p90_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as p95_revenue,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY revenue) as p99_revenue
FROM marketing_events
WHERE event_type = 'conversion'
  AND revenue > 0
GROUP BY channel;
"""

# Example 5: Moving aggregations
moving_aggregations = """
WITH daily_metrics AS (
    SELECT 
        DATE_TRUNC('day', timestamp) as date,
        channel,
        SUM(revenue) as daily_revenue,
        COUNT(*) as daily_events
    FROM marketing_events
    GROUP BY DATE_TRUNC('day', timestamp), channel
)
SELECT 
    date,
    channel,
    daily_revenue,
    -- Moving aggregations
    SUM(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as revenue_7d,
    AVG(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as avg_revenue_7d,
    MIN(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as min_revenue_7d,
    MAX(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as max_revenue_7d,
    STDDEV(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as stddev_revenue_7d,
    -- Cumulative aggregations
    SUM(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS UNBOUNDED PRECEDING) as cumulative_revenue,
    AVG(daily_revenue) OVER (PARTITION BY channel ORDER BY date ROWS UNBOUNDED PRECEDING) as cumulative_avg_revenue
FROM daily_metrics
ORDER BY channel, date;
"""

print("✓ Advanced aggregation queries defined")

## 7. Data Quality and Validation at Scale

In [None]:
# Example 1: Comprehensive data quality checks
data_quality_checks = """
WITH quality_checks AS (
    SELECT 
        'Row Count' as check_type,
        COUNT(*) as metric_value,
        CAST(NULL AS DECIMAL(10,2)) as threshold,
        'INFO' as severity
    FROM marketing_events
    WHERE date = CURRENT_DATE - 1
    
    UNION ALL
    
    SELECT 
        'Null user_id' as check_type,
        COUNT(*) as metric_value,
        0 as threshold,
        'ERROR' as severity
    FROM marketing_events
    WHERE date = CURRENT_DATE - 1
      AND user_id IS NULL
    
    UNION ALL
    
    SELECT 
        'Negative Revenue' as check_type,
        COUNT(*) as metric_value,
        0 as threshold,
        'ERROR' as severity
    FROM marketing_events
    WHERE date = CURRENT_DATE - 1
      AND revenue < 0
    
    UNION ALL
    
    SELECT 
        'Future Timestamps' as check_type,
        COUNT(*) as metric_value,
        0 as threshold,
        'ERROR' as severity
    FROM marketing_events
    WHERE date = CURRENT_DATE - 1
      AND timestamp > GETDATE()
    
    UNION ALL
    
    SELECT 
        'Duplicate event_id' as check_type,
        COUNT(*) as metric_value,
        0 as threshold,
        'ERROR' as severity
    FROM (
        SELECT event_id, COUNT(*) as cnt
        FROM marketing_events
        WHERE date = CURRENT_DATE - 1
        GROUP BY event_id
        HAVING COUNT(*) > 1
    )
    
    UNION ALL
    
    SELECT 
        'Conversion without click' as check_type,
        COUNT(DISTINCT c.user_id) as metric_value,
        CAST(NULL AS DECIMAL) as threshold,
        'WARNING' as severity
    FROM marketing_events c
    WHERE c.date = CURRENT_DATE - 1
      AND c.event_type = 'conversion'
      AND NOT EXISTS (
          SELECT 1 FROM marketing_events e
          WHERE e.user_id = c.user_id
            AND e.event_type = 'click'
            AND e.timestamp < c.timestamp
            AND e.timestamp >= c.timestamp - INTERVAL '30 days'
      )
)
SELECT 
    check_type,
    metric_value,
    threshold,
    severity,
    CASE 
        WHEN threshold IS NULL THEN 'N/A'
        WHEN metric_value <= threshold THEN 'PASS'
        ELSE 'FAIL'
    END as status
FROM quality_checks
ORDER BY 
    CASE severity 
        WHEN 'ERROR' THEN 1 
        WHEN 'WARNING' THEN 2 
        ELSE 3 
    END,
    check_type;
"""

# Example 2: Statistical outlier detection
outlier_detection = """
WITH revenue_stats AS (
    SELECT 
        user_id,
        SUM(revenue) as total_revenue,
        COUNT(*) as transaction_count,
        AVG(revenue) as avg_transaction_value
    FROM marketing_events
    WHERE event_type = 'conversion'
      AND date >= CURRENT_DATE - 30
    GROUP BY user_id
),
distribution_stats AS (
    SELECT 
        AVG(total_revenue) as mean_revenue,
        STDDEV(total_revenue) as stddev_revenue,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_revenue) as q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_revenue) as q3
    FROM revenue_stats
)
SELECT 
    r.user_id,
    r.total_revenue,
    r.transaction_count,
    r.avg_transaction_value,
    d.mean_revenue,
    d.stddev_revenue,
    -- Z-score method
    (r.total_revenue - d.mean_revenue) / NULLIF(d.stddev_revenue, 0) as z_score,
    -- IQR method
    d.q3 - d.q1 as iqr,
    d.q1 - 1.5 * (d.q3 - d.q1) as lower_bound,
    d.q3 + 1.5 * (d.q3 - d.q1) as upper_bound,
    CASE 
        WHEN ABS((r.total_revenue - d.mean_revenue) / NULLIF(d.stddev_revenue, 0)) > 3 THEN 'Z-Score Outlier'
        WHEN r.total_revenue < d.q1 - 1.5 * (d.q3 - d.q1) THEN 'IQR Low Outlier'
        WHEN r.total_revenue > d.q3 + 1.5 * (d.q3 - d.q1) THEN 'IQR High Outlier'
        ELSE 'Normal'
    END as outlier_type
FROM revenue_stats r
CROSS JOIN distribution_stats d
WHERE ABS((r.total_revenue - d.mean_revenue) / NULLIF(d.stddev_revenue, 0)) > 3
   OR r.total_revenue < d.q1 - 1.5 * (d.q3 - d.q1)
   OR r.total_revenue > d.q3 + 1.5 * (d.q3 - d.q1)
ORDER BY ABS((r.total_revenue - d.mean_revenue) / NULLIF(d.stddev_revenue, 0)) DESC;
"""

# Example 3: Data completeness monitoring
data_completeness = """
SELECT 
    date,
    COUNT(*) as total_rows,
    COUNT(user_id) as user_id_count,
    COUNT(campaign_id) as campaign_id_count,
    COUNT(channel) as channel_count,
    COUNT(revenue) as revenue_count,
    -- Completeness percentages
    COUNT(user_id) * 100.0 / COUNT(*) as user_id_completeness,
    COUNT(campaign_id) * 100.0 / COUNT(*) as campaign_id_completeness,
    COUNT(channel) * 100.0 / COUNT(*) as channel_completeness,
    COUNT(revenue) * 100.0 / COUNT(*) as revenue_completeness,
    -- Day-over-day change
    COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY date) as row_change,
    (COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY date)) * 100.0 / 
        NULLIF(LAG(COUNT(*)) OVER (ORDER BY date), 0) as row_change_pct
FROM marketing_events
WHERE date >= CURRENT_DATE - 30
GROUP BY date
ORDER BY date DESC;
"""

print("✓ Data quality queries defined")

## 8. Real-World Project: Multi-Touch Attribution Optimization

### Complete attribution system implementation

In [None]:
# Step 1: Build user journey table
create_user_journeys = """
CREATE TABLE user_journeys AS
WITH ordered_events AS (
    SELECT 
        user_id,
        event_id,
        timestamp,
        channel,
        campaign_id,
        event_type,
        revenue,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp) as event_sequence,
        LEAD(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp) as next_event_time,
        LEAD(event_type) OVER (PARTITION BY user_id ORDER BY timestamp) as next_event_type
    FROM marketing_events
    WHERE date >= '2024-01-01'
),
conversions AS (
    SELECT 
        user_id,
        MAX(CASE WHEN event_type = 'conversion' THEN timestamp END) as conversion_time,
        MAX(CASE WHEN event_type = 'conversion' THEN revenue END) as conversion_value
    FROM ordered_events
    GROUP BY user_id
    HAVING MAX(CASE WHEN event_type = 'conversion' THEN 1 ELSE 0 END) = 1
)
SELECT 
    e.user_id,
    e.event_id,
    e.timestamp,
    e.channel,
    e.campaign_id,
    e.event_type,
    e.event_sequence,
    c.conversion_time,
    c.conversion_value,
    DATEDIFF(second, e.timestamp, c.conversion_time) as seconds_to_conversion,
    CASE WHEN e.event_sequence = 1 THEN 1 ELSE 0 END as is_first_touch,
    CASE WHEN e.next_event_type = 'conversion' THEN 1 ELSE 0 END as is_last_touch
FROM ordered_events e
INNER JOIN conversions c ON e.user_id = c.user_id
WHERE e.timestamp <= c.conversion_time
DISTKEY(user_id)
SORTKEY(user_id, event_sequence);
"""

# Step 2: First-touch attribution
first_touch_attribution = """
SELECT 
    channel,
    campaign_id,
    COUNT(DISTINCT user_id) as attributed_conversions,
    SUM(conversion_value) as attributed_revenue
FROM user_journeys
WHERE is_first_touch = 1
GROUP BY channel, campaign_id
ORDER BY attributed_revenue DESC;
"""

# Step 3: Last-touch attribution
last_touch_attribution = """
SELECT 
    channel,
    campaign_id,
    COUNT(DISTINCT user_id) as attributed_conversions,
    SUM(conversion_value) as attributed_revenue
FROM user_journeys
WHERE is_last_touch = 1
GROUP BY channel, campaign_id
ORDER BY attributed_revenue DESC;
"""

# Step 4: Linear attribution
linear_attribution = """
WITH touchpoint_counts AS (
    SELECT 
        user_id,
        COUNT(*) as total_touchpoints
    FROM user_journeys
    GROUP BY user_id
)
SELECT 
    j.channel,
    j.campaign_id,
    COUNT(*) as total_touchpoints,
    SUM(j.conversion_value / t.total_touchpoints) as attributed_revenue
FROM user_journeys j
INNER JOIN touchpoint_counts t ON j.user_id = t.user_id
GROUP BY j.channel, j.campaign_id
ORDER BY attributed_revenue DESC;
"""

# Step 5: Time-decay attribution
time_decay_attribution = """
WITH weighted_touchpoints AS (
    SELECT 
        user_id,
        channel,
        campaign_id,
        conversion_value,
        seconds_to_conversion,
        -- Exponential decay: weight = e^(-lambda * time)
        -- Using 7-day half-life: lambda = ln(2) / (7 * 86400)
        EXP(-0.000001155 * seconds_to_conversion) as time_weight
    FROM user_journeys
),
normalized_weights AS (
    SELECT 
        user_id,
        channel,
        campaign_id,
        conversion_value,
        time_weight,
        SUM(time_weight) OVER (PARTITION BY user_id) as total_weight
    FROM weighted_touchpoints
)
SELECT 
    channel,
    campaign_id,
    COUNT(DISTINCT user_id) as users_influenced,
    SUM(conversion_value * time_weight / total_weight) as attributed_revenue
FROM normalized_weights
GROUP BY channel, campaign_id
ORDER BY attributed_revenue DESC;
"""

# Step 6: Position-based (U-shaped) attribution
position_based_attribution = """
WITH journey_positions AS (
    SELECT 
        user_id,
        channel,
        campaign_id,
        conversion_value,
        event_sequence,
        MAX(event_sequence) OVER (PARTITION BY user_id) as total_touchpoints,
        CASE 
            WHEN event_sequence = 1 THEN 0.4  -- First touch: 40%
            WHEN event_sequence = MAX(event_sequence) OVER (PARTITION BY user_id) THEN 0.4  -- Last touch: 40%
            ELSE 0.2 / NULLIF(MAX(event_sequence) OVER (PARTITION BY user_id) - 2, 0)  -- Middle touches: 20% divided
        END as position_weight
    FROM user_journeys
)
SELECT 
    channel,
    campaign_id,
    COUNT(DISTINCT user_id) as users_influenced,
    SUM(conversion_value * position_weight) as attributed_revenue
FROM journey_positions
GROUP BY channel, campaign_id
ORDER BY attributed_revenue DESC;
"""

# Step 7: Compare all attribution models
compare_attribution_models = """
WITH first_touch AS (
    SELECT channel, SUM(conversion_value) as revenue
    FROM user_journeys
    WHERE is_first_touch = 1
    GROUP BY channel
),
last_touch AS (
    SELECT channel, SUM(conversion_value) as revenue
    FROM user_journeys
    WHERE is_last_touch = 1
    GROUP BY channel
),
linear AS (
    SELECT 
        j.channel,
        SUM(j.conversion_value / t.total_touchpoints) as revenue
    FROM user_journeys j
    INNER JOIN (
        SELECT user_id, COUNT(*) as total_touchpoints
        FROM user_journeys
        GROUP BY user_id
    ) t ON j.user_id = t.user_id
    GROUP BY j.channel
)
SELECT 
    COALESCE(ft.channel, lt.channel, l.channel) as channel,
    ft.revenue as first_touch_revenue,
    lt.revenue as last_touch_revenue,
    l.revenue as linear_revenue,
    -- Calculate variance across models
    STDDEV_SAMP(ARRAY[ft.revenue, lt.revenue, l.revenue]) as model_variance
FROM first_touch ft
FULL OUTER JOIN last_touch lt ON ft.channel = lt.channel
FULL OUTER JOIN linear l ON COALESCE(ft.channel, lt.channel) = l.channel
ORDER BY first_touch_revenue DESC;
"""

print("""
✓ Multi-Touch Attribution System Created

Attribution Models Implemented:
1. First-Touch: All credit to first interaction
2. Last-Touch: All credit to last interaction
3. Linear: Equal credit to all touchpoints
4. Time-Decay: More credit to recent interactions
5. Position-Based: 40% first, 40% last, 20% middle

Next Steps:
- Create materialized views for each model
- Build reporting dashboard
- Schedule daily refreshes
- Compare model predictions vs actual results
""")

## 9. Query Performance Optimization

In [None]:
# Performance optimization checklist
optimization_techniques = """
-- 1. Use CTEs to break complex queries into steps
-- 2. Filter early (WHERE clauses before joins)
-- 3. Use appropriate distribution keys
-- 4. Add sort keys for common filters
-- 5. Avoid SELECT * - specify columns
-- 6. Use EXPLAIN to analyze query plans
-- 7. Materialize intermediate results for complex queries
-- 8. Use UNLOAD for large exports
-- 9. Partition large tables by date
-- 10. Regular VACUUM and ANALYZE
"""

# Example: Optimize slow attribution query
optimized_attribution = """
-- BEFORE: Slow query with multiple scans
-- SELECT ... multiple window functions on full table

-- AFTER: Optimized with CTEs and filtering
WITH recent_events AS (
    -- Filter early
    SELECT user_id, channel, timestamp, revenue
    FROM marketing_events
    WHERE date >= '2024-01-01'
      AND date < '2024-02-01'
      AND event_type IN ('click', 'conversion')
),
user_conversions AS (
    -- Identify converters only
    SELECT user_id, MAX(timestamp) as conversion_time, MAX(revenue) as revenue
    FROM recent_events
    WHERE revenue > 0
    GROUP BY user_id
),
attribution_data AS (
    -- Join only necessary data
    SELECT 
        e.channel,
        c.revenue,
        ROW_NUMBER() OVER (PARTITION BY e.user_id ORDER BY e.timestamp DESC) as rn
    FROM recent_events e
    INNER JOIN user_conversions c ON e.user_id = c.user_id AND e.timestamp <= c.conversion_time
)
SELECT 
    channel,
    SUM(revenue) as attributed_revenue
FROM attribution_data
WHERE rn = 1  -- Last touch
GROUP BY channel;
"""

print("✓ Optimization techniques documented")

## 10. Best Practices Summary

### Advanced SQL Techniques
1. Use Python UDFs for complex logic not possible in SQL
2. Leverage recursive CTEs for hierarchical data
3. Apply window functions for sequential analysis
4. Implement multiple attribution models for comparison
5. Use time-series functions for trend analysis

### Performance
1. Always use EXPLAIN for complex queries
2. Filter early and often
3. Use appropriate distribution and sort keys
4. Materialize frequently-used intermediate results
5. Monitor query performance regularly

### Data Quality
1. Implement comprehensive data quality checks
2. Monitor data completeness and consistency
3. Detect and handle outliers appropriately
4. Validate business logic constraints
5. Track data quality metrics over time

## 11. Exercises

### Exercise 1: Attribution Models
Implement and compare:
1. First-touch attribution
2. Last-touch attribution
3. Linear attribution
4. Time-decay attribution
5. Custom weighted attribution
Analyze which model best fits your data.

### Exercise 2: User Journey Analysis
1. Build complete user journey table
2. Identify common conversion paths
3. Calculate path metrics (length, value, time)
4. Find high-performing channel sequences
5. Optimize marketing mix based on insights

### Exercise 3: Time-Series Analysis
1. Decompose revenue into trend and seasonality
2. Detect anomalies in daily metrics
3. Forecast next 30 days revenue
4. Identify change points
5. Create alerting rules

### Exercise 4: Data Quality Framework
Build a complete data quality framework:
1. Define quality metrics
2. Implement automated checks
3. Create quality dashboards
4. Set up alerting
5. Track quality trends

## Resources

### Documentation
- [Redshift UDFs](https://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html)
- [Window Functions](https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html)
- [Recursive CTEs](https://docs.aws.amazon.com/redshift/latest/dg/r_WITH_clause.html)
- [Attribution Modeling Guide](https://support.google.com/analytics/answer/1662518)

### Papers & Articles
- Multi-Touch Attribution: Theory and Practice
- Markov Chain Attribution Models
- Shapley Value for Marketing Attribution
- Time-Series Analysis for Marketing

### Tools
- Redshift Query Editor
- Mode Analytics: SQL + visualization
- Looker: BI platform
- dbt: Data transformation tool