# Week 9: Advanced Attribution at Scale

## Enterprise-Scale Multi-Touch Attribution (100M+ Touchpoints)

### Learning Objectives
- Implement attribution models on 100M+ touchpoints
- Use efficient graph algorithms for customer journey analysis
- Apply Markov chains with sparse matrices
- Approximate Shapley values for large datasets
- Build distributed attribution computation systems
- Create attribution dashboards with Redshift backend

### Prerequisites
- AWS Redshift cluster access
- Python 3.8+
- Understanding of attribution modeling concepts
- Familiarity with distributed computing

## 1. Environment Setup and Dependencies

In [None]:
# Install required packages
!pip install psycopg2-binary pandas numpy scipy scikit-learn \
    networkx matplotlib seaborn plotly dash dask[complete] \
    sqlalchemy pyarrow fastparquet joblib tqdm

In [None]:
import psycopg2
import pandas as pd
import numpy as np
from scipy import sparse
from scipy.sparse import csr_matrix, linalg as sparse_linalg
import networkx as nx
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from sqlalchemy import create_engine
import dask.dataframe as dd
from dask.distributed import Client
from joblib import Parallel, delayed
from itertools import combinations
from collections import defaultdict, Counter
from datetime import datetime, timedelta
import warnings
from tqdm import tqdm
import json

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

print("✓ All packages imported successfully")

## 2. Redshift Connection and Data Architecture

In [None]:
class RedshiftConnection:
    """
    Production-ready Redshift connection manager with connection pooling
    and error handling.
    """
    
    def __init__(self, host, port, database, user, password):
        self.host = host
        self.port = port
        self.database = database
        self.user = user
        self.password = password
        self.engine = None
        self.conn = None
        
    def connect(self):
        """Establish connection to Redshift"""
        try:
            # SQLAlchemy engine for pandas integration
            conn_string = f"postgresql+psycopg2://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}"
            self.engine = create_engine(conn_string, pool_size=10, max_overflow=20)
            
            # Direct psycopg2 connection for raw queries
            self.conn = psycopg2.connect(
                host=self.host,
                port=self.port,
                database=self.database,
                user=self.user,
                password=self.password
            )
            print(f"✓ Connected to Redshift: {self.database}")
            return True
        except Exception as e:
            print(f"✗ Connection failed: {e}")
            return False
    
    def execute_query(self, query, return_df=True):
        """Execute query and optionally return DataFrame"""
        if return_df:
            return pd.read_sql(query, self.engine)
        else:
            cursor = self.conn.cursor()
            cursor.execute(query)
            self.conn.commit()
            cursor.close()
            return None
    
    def close(self):
        """Close connections"""
        if self.conn:
            self.conn.close()
        if self.engine:
            self.engine.dispose()
        print("✓ Connections closed")

# Initialize connection (replace with your credentials)
rs = RedshiftConnection(
    host='your-cluster.region.redshift.amazonaws.com',
    port=5439,
    database='marketing_db',
    user='your_user',
    password='your_password'
)

# Uncomment to connect
# rs.connect()

## 3. Schema Design for Attribution at Scale

### Optimized Redshift Tables for 100M+ Touchpoints

In [None]:
# Create optimized Redshift tables for attribution
create_tables_sql = """
-- Touchpoint events table (100M+ rows)
-- Distribution key: user_id for efficient joins
-- Sort key: timestamp for time-series queries
CREATE TABLE IF NOT EXISTS touchpoints (
    touchpoint_id BIGINT IDENTITY(1,1),
    user_id VARCHAR(64) NOT NULL,
    session_id VARCHAR(64),
    timestamp TIMESTAMP NOT NULL,
    channel VARCHAR(50) NOT NULL,
    campaign VARCHAR(100),
    device VARCHAR(20),
    revenue DECIMAL(10,2) DEFAULT 0,
    converted BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (touchpoint_id)
)
DISTKEY(user_id)
SORTKEY(timestamp);

-- Pre-aggregated customer journeys
-- Stores journey paths for faster attribution computation
CREATE TABLE IF NOT EXISTS customer_journeys (
    journey_id BIGINT IDENTITY(1,1),
    user_id VARCHAR(64) NOT NULL,
    journey_path VARCHAR(4096),  -- Comma-separated channel sequence
    conversion_value DECIMAL(10,2),
    touchpoint_count INTEGER,
    journey_duration_hours INTEGER,
    first_touch_date DATE,
    conversion_date DATE,
    PRIMARY KEY (journey_id)
)
DISTKEY(user_id)
SORTKEY(conversion_date);

-- Attribution results table
CREATE TABLE IF NOT EXISTS attribution_results (
    result_id BIGINT IDENTITY(1,1),
    user_id VARCHAR(64),
    channel VARCHAR(50),
    model_type VARCHAR(50),
    attribution_credit DECIMAL(10,4),
    attributed_revenue DECIMAL(10,2),
    computation_date TIMESTAMP DEFAULT GETDATE(),
    PRIMARY KEY (result_id)
)
DISTKEY(user_id)
SORTKEY(computation_date, channel);

-- Markov chain transition matrices
CREATE TABLE IF NOT EXISTS markov_transitions (
    from_state VARCHAR(50),
    to_state VARCHAR(50),
    transition_count BIGINT,
    transition_probability DECIMAL(10,8),
    model_version VARCHAR(20),
    computation_date DATE,
    PRIMARY KEY (from_state, to_state, model_version)
)
DISTKEY(from_state)
SORTKEY(model_version, computation_date);

-- Channel performance summary (for dashboards)
CREATE TABLE IF NOT EXISTS channel_performance (
    channel VARCHAR(50),
    date DATE,
    total_touchpoints BIGINT,
    conversions INTEGER,
    revenue DECIMAL(12,2),
    first_touch_credit DECIMAL(12,2),
    last_touch_credit DECIMAL(12,2),
    linear_credit DECIMAL(12,2),
    markov_credit DECIMAL(12,2),
    shapley_credit DECIMAL(12,2),
    PRIMARY KEY (channel, date)
)
DISTKEY(channel)
SORTKEY(date);
"""

# Execute table creation
# rs.execute_query(create_tables_sql, return_df=False)
print("✓ Table schemas defined")

## 4. Efficient Data Loading from Redshift

### Chunked Loading for Large Datasets

In [None]:
class AttributionDataLoader:
    """
    Efficient data loader for large-scale attribution analysis.
    Uses chunked loading and Dask for datasets that don't fit in memory.
    """
    
    def __init__(self, rs_connection):
        self.rs = rs_connection
        
    def load_touchpoints_chunked(self, start_date, end_date, chunk_size=1_000_000):
        """
        Load touchpoints in chunks to handle 100M+ rows.
        """
        query = f"""
        SELECT 
            user_id,
            session_id,
            timestamp,
            channel,
            campaign,
            device,
            revenue,
            converted
        FROM touchpoints
        WHERE timestamp BETWEEN '{start_date}' AND '{end_date}'
        ORDER BY user_id, timestamp
        """
        
        chunks = []
        for chunk in pd.read_sql(query, self.rs.engine, chunksize=chunk_size):
            chunks.append(chunk)
            print(f"Loaded chunk: {len(chunk):,} rows")
        
        return pd.concat(chunks, ignore_index=True)
    
    def load_journeys_optimized(self, start_date, end_date):
        """
        Load pre-aggregated journeys for faster processing.
        This query runs entirely in Redshift for efficiency.
        """
        query = f"""
        WITH journey_aggregation AS (
            SELECT 
                user_id,
                LISTAGG(channel, ' > ') WITHIN GROUP (ORDER BY timestamp) as journey_path,
                MAX(CASE WHEN converted THEN revenue ELSE 0 END) as conversion_value,
                COUNT(*) as touchpoint_count,
                DATEDIFF(hour, MIN(timestamp), MAX(timestamp)) as journey_duration_hours,
                DATE(MIN(timestamp)) as first_touch_date,
                DATE(MAX(timestamp)) as conversion_date,
                MAX(converted::INT) as converted
            FROM touchpoints
            WHERE timestamp BETWEEN '{start_date}' AND '{end_date}'
            GROUP BY user_id
        )
        SELECT *
        FROM journey_aggregation
        WHERE converted = 1  -- Only conversion journeys for attribution
        """
        
        return pd.read_sql(query, self.rs.engine)
    
    def get_channel_summary(self, start_date, end_date):
        """
        Get channel-level summary statistics.
        """
        query = f"""
        SELECT 
            channel,
            COUNT(*) as total_touchpoints,
            COUNT(DISTINCT user_id) as unique_users,
            SUM(CASE WHEN converted THEN 1 ELSE 0 END) as conversions,
            SUM(revenue) as total_revenue
        FROM touchpoints
        WHERE timestamp BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY channel
        ORDER BY total_revenue DESC
        """
        
        return pd.read_sql(query, self.rs.engine)

# Example usage (uncomment when connected)
# loader = AttributionDataLoader(rs)
# journeys_df = loader.load_journeys_optimized('2024-01-01', '2024-12-31')
print("✓ Data loader configured")

## 5. Simulated Large-Scale Dataset

### Generate 10M Customer Journeys for Testing

In [None]:
def generate_large_scale_journeys(n_journeys=10_000_000, save_to_file=True):
    """
    Generate large-scale synthetic journey data for testing.
    Uses memory-efficient generation with Dask.
    """
    print(f"Generating {n_journeys:,} customer journeys...")
    
    channels = ['Paid Search', 'Organic Search', 'Social', 'Display', 
                'Email', 'Direct', 'Affiliate', 'Video']
    
    # Journey patterns (realistic probabilities)
    journey_templates = [
        ['Paid Search', 'Direct'],  # 30%
        ['Organic Search', 'Direct'],  # 25%
        ['Social', 'Paid Search', 'Direct'],  # 15%
        ['Display', 'Organic Search', 'Direct'],  # 10%
        ['Email', 'Direct'],  # 10%
        ['Social', 'Email', 'Paid Search', 'Direct'],  # 5%
        ['Display', 'Social', 'Organic Search', 'Direct'],  # 3%
        ['Affiliate', 'Paid Search', 'Direct'],  # 2%
    ]
    
    template_probs = [0.30, 0.25, 0.15, 0.10, 0.10, 0.05, 0.03, 0.02]
    
    def generate_batch(batch_size=100000):
        """Generate a batch of journeys"""
        template_indices = np.random.choice(
            len(journey_templates), 
            size=batch_size, 
            p=template_probs
        )
        
        batch_journeys = []
        for idx in template_indices:
            template = journey_templates[idx].copy()
            # Add some randomness
            if np.random.random() < 0.3:
                # Insert random channel
                insert_pos = np.random.randint(0, len(template))
                template.insert(insert_pos, np.random.choice(channels))
            
            batch_journeys.append(' > '.join(template))
        
        # Generate conversion values (log-normal distribution)
        conversion_values = np.random.lognormal(mean=4.0, sigma=1.0, size=batch_size)
        
        return pd.DataFrame({
            'user_id': [f'user_{i}' for i in range(batch_size)],
            'journey_path': batch_journeys,
            'conversion_value': conversion_values,
            'touchpoint_count': [len(j.split(' > ')) for j in batch_journeys]
        })
    
    # Generate in batches to manage memory
    batch_size = 100000
    n_batches = n_journeys // batch_size
    
    all_batches = []
    for i in tqdm(range(n_batches), desc="Generating batches"):
        batch_df = generate_batch(batch_size)
        all_batches.append(batch_df)
    
    # Handle remainder
    remainder = n_journeys % batch_size
    if remainder > 0:
        batch_df = generate_batch(remainder)
        all_batches.append(batch_df)
    
    journeys_df = pd.concat(all_batches, ignore_index=True)
    journeys_df['user_id'] = [f'user_{i}' for i in range(len(journeys_df))]
    
    if save_to_file:
        # Save in chunks using parquet for efficiency
        journeys_df.to_parquet('large_scale_journeys.parquet', compression='snappy')
        print(f"✓ Saved {len(journeys_df):,} journeys to large_scale_journeys.parquet")
    
    return journeys_df

# Generate sample dataset (reduce size for demo)
print("Generating sample dataset (100K journeys for demo)...")
journeys_df = generate_large_scale_journeys(n_journeys=100000, save_to_file=False)

print(f"\n✓ Dataset shape: {journeys_df.shape}")
print(f"✓ Total conversion value: ${journeys_df['conversion_value'].sum():,.2f}")
print(f"\nSample journeys:")
print(journeys_df.head(10))

## 6. Markov Chain Attribution with Sparse Matrices

### Memory-Efficient Implementation for Large-Scale Data

In [None]:
class SparseMarkovAttribution:
    """
    Efficient Markov Chain attribution using sparse matrices.
    Handles millions of journeys with minimal memory footprint.
    """
    
    def __init__(self):
        self.transition_matrix = None
        self.removal_effects = {}
        self.channels = None
        self.state_to_idx = {}
        self.idx_to_state = {}
        
    def build_transition_matrix(self, journeys_df):
        """
        Build sparse transition matrix from journey paths.
        """
        print("Building transition matrix...")
        
        # Extract all unique channels
        all_channels = set()
        for path in journeys_df['journey_path']:
            channels = path.split(' > ')
            all_channels.update(channels)
        
        # Add special states
        states = ['START'] + sorted(list(all_channels)) + ['CONVERSION', 'NULL']
        self.channels = sorted(list(all_channels))
        
        # Create state mappings
        self.state_to_idx = {state: idx for idx, state in enumerate(states)}
        self.idx_to_state = {idx: state for state, idx in self.state_to_idx.items()}
        
        n_states = len(states)
        
        # Count transitions using dictionary (memory efficient)
        transition_counts = defaultdict(int)
        
        for _, row in tqdm(journeys_df.iterrows(), total=len(journeys_df), desc="Counting transitions"):
            path = row['journey_path'].split(' > ')
            value = row['conversion_value']
            
            # START to first channel
            transition_counts[(self.state_to_idx['START'], 
                             self.state_to_idx[path[0]])] += value
            
            # Channel to channel transitions
            for i in range(len(path) - 1):
                from_idx = self.state_to_idx[path[i]]
                to_idx = self.state_to_idx[path[i + 1]]
                transition_counts[(from_idx, to_idx)] += value
            
            # Last channel to CONVERSION
            transition_counts[(self.state_to_idx[path[-1]], 
                             self.state_to_idx['CONVERSION'])] += value
        
        # Build sparse matrix
        row_indices = []
        col_indices = []
        data = []
        
        for (from_idx, to_idx), count in transition_counts.items():
            row_indices.append(from_idx)
            col_indices.append(to_idx)
            data.append(count)
        
        # Create sparse matrix
        self.transition_matrix = csr_matrix(
            (data, (row_indices, col_indices)),
            shape=(n_states, n_states)
        )
        
        # Normalize to get probabilities
        row_sums = np.array(self.transition_matrix.sum(axis=1)).flatten()
        row_sums[row_sums == 0] = 1  # Avoid division by zero
        
        # Normalize each row
        self.transition_matrix = self.transition_matrix.multiply(
            1.0 / row_sums[:, np.newaxis]
        )
        
        print(f"✓ Transition matrix built: {n_states} states, {len(data):,} transitions")
        print(f"✓ Matrix density: {self.transition_matrix.nnz / (n_states ** 2):.4%}")
        
    def calculate_conversion_probability(self):
        """
        Calculate base conversion probability from START to CONVERSION.
        """
        start_idx = self.state_to_idx['START']
        conversion_idx = self.state_to_idx['CONVERSION']
        
        # Use matrix exponentiation to find long-term probabilities
        # P(conversion) = sum of all paths from START to CONVERSION
        n_states = self.transition_matrix.shape[0]
        
        # Identity matrix
        I = sparse.eye(n_states, format='csr')
        
        # (I - Q)^(-1) where Q is the transient states submatrix
        # For simplicity, we'll use iterative approach
        
        # Start with initial probability vector
        prob_vector = np.zeros(n_states)
        prob_vector[start_idx] = 1.0
        
        # Iterate until convergence
        for _ in range(100):  # Max iterations
            prob_vector = self.transition_matrix.T.dot(prob_vector)
        
        return prob_vector[conversion_idx]
    
    def calculate_removal_effects(self):
        """
        Calculate removal effect for each channel.
        This shows the impact of removing each channel from the model.
        """
        print("\nCalculating removal effects...")
        
        base_conversion_prob = self.calculate_conversion_probability()
        
        for channel in tqdm(self.channels, desc="Processing channels"):
            # Create modified matrix with channel removed
            modified_matrix = self.transition_matrix.copy()
            
            channel_idx = self.state_to_idx[channel]
            
            # Redirect transitions through removed channel to NULL state
            null_idx = self.state_to_idx['NULL']
            
            # Zero out the channel's row and column
            modified_matrix[channel_idx, :] = 0
            modified_matrix[:, channel_idx] = 0
            
            # Redirect to NULL
            for i in range(modified_matrix.shape[0]):
                if self.transition_matrix[i, channel_idx] > 0:
                    modified_matrix[i, null_idx] = self.transition_matrix[i, channel_idx]
            
            # Renormalize
            row_sums = np.array(modified_matrix.sum(axis=1)).flatten()
            row_sums[row_sums == 0] = 1
            modified_matrix = modified_matrix.multiply(1.0 / row_sums[:, np.newaxis])
            
            # Calculate new conversion probability
            start_idx = self.state_to_idx['START']
            conversion_idx = self.state_to_idx['CONVERSION']
            
            prob_vector = np.zeros(modified_matrix.shape[0])
            prob_vector[start_idx] = 1.0
            
            for _ in range(100):
                prob_vector = modified_matrix.T.dot(prob_vector)
            
            new_conversion_prob = prob_vector[conversion_idx]
            
            # Removal effect
            removal_effect = 1 - (new_conversion_prob / base_conversion_prob)
            self.removal_effects[channel] = removal_effect
        
        print("✓ Removal effects calculated")
        
    def attribute_conversions(self, total_conversions, total_revenue):
        """
        Attribute conversions and revenue to channels based on removal effects.
        """
        # Normalize removal effects
        total_removal = sum(self.removal_effects.values())
        
        if total_removal == 0:
            print("Warning: Total removal effect is zero")
            return {}
        
        attribution = {}
        for channel, effect in self.removal_effects.items():
            attribution[channel] = {
                'removal_effect': effect,
                'attribution_weight': effect / total_removal,
                'attributed_conversions': (effect / total_removal) * total_conversions,
                'attributed_revenue': (effect / total_removal) * total_revenue
            }
        
        return attribution

# Test with sample data
print("\n" + "="*60)
print("MARKOV CHAIN ATTRIBUTION")
print("="*60)

markov_model = SparseMarkovAttribution()
markov_model.build_transition_matrix(journeys_df)
markov_model.calculate_removal_effects()

total_revenue = journeys_df['conversion_value'].sum()
total_conversions = len(journeys_df)

markov_attribution = markov_model.attribute_conversions(total_conversions, total_revenue)

# Display results
print("\nMarkov Attribution Results:")
results_df = pd.DataFrame(markov_attribution).T
results_df = results_df.sort_values('attributed_revenue', ascending=False)
print(results_df.to_string())

## 7. Shapley Value Approximation for Large Datasets

### Monte Carlo Sampling for Efficient Computation

In [None]:
class ApproximateShapleyAttribution:
    """
    Approximate Shapley values using Monte Carlo sampling.
    Exact Shapley computation is O(2^n) - infeasible for many channels.
    This approximation is O(n * m) where m is number of samples.
    """
    
    def __init__(self, n_samples=10000):
        self.n_samples = n_samples
        self.shapley_values = {}
        
    def calculate_coalition_value(self, coalition, journeys_df):
        """
        Calculate the value (conversions) for a coalition of channels.
        """
        if not coalition:
            return 0
        
        coalition_set = set(coalition)
        total_value = 0
        
        for _, row in journeys_df.iterrows():
            journey_channels = set(row['journey_path'].split(' > '))
            # Journey succeeds if it only uses channels in the coalition
            if journey_channels.issubset(coalition_set):
                total_value += row['conversion_value']
        
        return total_value
    
    def approximate_shapley(self, journeys_df, channels=None):
        """
        Approximate Shapley values using Monte Carlo sampling.
        """
        if channels is None:
            # Extract all unique channels
            all_channels = set()
            for path in journeys_df['journey_path']:
                all_channels.update(path.split(' > '))
            channels = sorted(list(all_channels))
        
        n_channels = len(channels)
        print(f"Approximating Shapley values for {n_channels} channels...")
        print(f"Using {self.n_samples:,} Monte Carlo samples")
        
        # Initialize marginal contribution sums
        marginal_contribs = {channel: [] for channel in channels}
        
        # Monte Carlo sampling
        for sample in tqdm(range(self.n_samples), desc="MC Sampling"):
            # Random permutation of channels
            permutation = np.random.permutation(channels)
            
            # For each channel in permutation, calculate marginal contribution
            for i, channel in enumerate(permutation):
                # Coalition before adding this channel
                coalition_before = permutation[:i].tolist()
                # Coalition after adding this channel
                coalition_after = permutation[:i+1].tolist()
                
                # Calculate marginal contribution
                value_before = self.calculate_coalition_value(coalition_before, journeys_df)
                value_after = self.calculate_coalition_value(coalition_after, journeys_df)
                
                marginal_contrib = value_after - value_before
                marginal_contribs[channel].append(marginal_contrib)
        
        # Average marginal contributions = Shapley value
        for channel in channels:
            self.shapley_values[channel] = np.mean(marginal_contribs[channel])
        
        print("✓ Shapley values calculated")
        return self.shapley_values
    
    def get_attribution(self, total_conversions, total_revenue):
        """
        Convert Shapley values to attribution percentages.
        """
        total_shapley = sum(self.shapley_values.values())
        
        if total_shapley == 0:
            print("Warning: Total Shapley value is zero")
            return {}
        
        attribution = {}
        for channel, value in self.shapley_values.items():
            attribution[channel] = {
                'shapley_value': value,
                'attribution_weight': value / total_shapley,
                'attributed_conversions': (value / total_shapley) * total_conversions,
                'attributed_revenue': (value / total_shapley) * total_revenue
            }
        
        return attribution

# Test with smaller sample for demo (use more samples in production)
print("\n" + "="*60)
print("SHAPLEY VALUE ATTRIBUTION (APPROXIMATE)")
print("="*60)

# Use subset for faster computation
sample_journeys = journeys_df.sample(n=min(1000, len(journeys_df)), random_state=42)

shapley_model = ApproximateShapleyAttribution(n_samples=100)  # Reduce for demo
shapley_values = shapley_model.approximate_shapley(sample_journeys)

total_revenue_sample = sample_journeys['conversion_value'].sum()
total_conversions_sample = len(sample_journeys)

shapley_attribution = shapley_model.get_attribution(total_conversions_sample, total_revenue_sample)

print("\nShapley Attribution Results:")
shapley_df = pd.DataFrame(shapley_attribution).T
shapley_df = shapley_df.sort_values('attributed_revenue', ascending=False)
print(shapley_df.to_string())

## 8. Distributed Attribution with Dask

### Parallel Processing for 100M+ Touchpoints

In [None]:
class DistributedAttribution:
    """
    Distributed attribution computation using Dask.
    Processes data in parallel across multiple workers.
    """
    
    def __init__(self, n_workers=4):
        self.client = Client(n_workers=n_workers, threads_per_worker=2)
        print(f"✓ Dask client started: {self.client.dashboard_link}")
        
    def parallel_journey_processing(self, journeys_df):
        """
        Process journeys in parallel using Dask.
        """
        # Convert to Dask DataFrame
        ddf = dd.from_pandas(journeys_df, npartitions=self.client.ncores())
        
        # Define processing function
        def process_journey(row):
            channels = row['journey_path'].split(' > ')
            n_touchpoints = len(channels)
            value = row['conversion_value']
            
            # Calculate linear attribution
            linear_credit = value / n_touchpoints
            
            # Calculate time decay attribution
            decay_factor = 0.5
            weights = [decay_factor ** (n_touchpoints - i - 1) for i in range(n_touchpoints)]
            total_weight = sum(weights)
            
            results = []
            for i, channel in enumerate(channels):
                results.append({
                    'user_id': row['user_id'],
                    'channel': channel,
                    'position': i,
                    'linear_credit': linear_credit,
                    'time_decay_credit': value * weights[i] / total_weight,
                    'first_touch_credit': value if i == 0 else 0,
                    'last_touch_credit': value if i == n_touchpoints - 1 else 0,
                })
            
            return results
        
        # Apply processing in parallel
        print("Processing journeys in parallel...")
        
        # Use map_partitions for parallel processing
        def process_partition(df):
            all_results = []
            for _, row in df.iterrows():
                all_results.extend(process_journey(row))
            return pd.DataFrame(all_results)
        
        results_ddf = ddf.map_partitions(process_partition, meta={
            'user_id': 'str',
            'channel': 'str',
            'position': 'int',
            'linear_credit': 'float',
            'time_decay_credit': 'float',
            'first_touch_credit': 'float',
            'last_touch_credit': 'float'
        })
        
        # Compute results
        results_df = results_ddf.compute()
        
        print(f"✓ Processed {len(results_df):,} touchpoint attributions")
        
        return results_df
    
    def aggregate_attribution(self, results_df):
        """
        Aggregate attribution results by channel.
        """
        aggregated = results_df.groupby('channel').agg({
            'linear_credit': 'sum',
            'time_decay_credit': 'sum',
            'first_touch_credit': 'sum',
            'last_touch_credit': 'sum'
        }).reset_index()
        
        aggregated.columns = ['channel', 'linear_revenue', 'time_decay_revenue', 
                             'first_touch_revenue', 'last_touch_revenue']
        
        return aggregated.sort_values('linear_revenue', ascending=False)
    
    def close(self):
        """Close Dask client"""
        self.client.close()
        print("✓ Dask client closed")

# Test distributed processing
print("\n" + "="*60)
print("DISTRIBUTED ATTRIBUTION COMPUTATION")
print("="*60)

dist_attr = DistributedAttribution(n_workers=2)
distributed_results = dist_attr.parallel_journey_processing(journeys_df)
aggregated_results = dist_attr.aggregate_attribution(distributed_results)

print("\nAggregated Attribution Results:")
print(aggregated_results.to_string(index=False))

dist_attr.close()

## 9. Graph-Based Journey Analysis

### Network Analysis for Customer Pathways

In [None]:
class JourneyGraphAnalyzer:
    """
    Analyze customer journeys using graph theory.
    Identify key paths, bottlenecks, and channel relationships.
    """
    
    def __init__(self):
        self.graph = nx.DiGraph()
        self.journey_paths = []
        
    def build_journey_graph(self, journeys_df, min_edge_weight=10):
        """
        Build directed graph from customer journeys.
        """
        print("Building journey graph...")
        
        edge_weights = defaultdict(float)
        
        for _, row in journeys_df.iterrows():
            channels = row['journey_path'].split(' > ')
            value = row['conversion_value']
            
            # Add edges with weights
            for i in range(len(channels) - 1):
                edge = (channels[i], channels[i + 1])
                edge_weights[edge] += value
        
        # Add edges to graph (filter by minimum weight)
        for (source, target), weight in edge_weights.items():
            if weight >= min_edge_weight:
                self.graph.add_edge(source, target, weight=weight)
        
        print(f"✓ Graph built: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges")
    
    def calculate_centrality_metrics(self):
        """
        Calculate various centrality metrics for channels.
        """
        print("\nCalculating centrality metrics...")
        
        metrics = {}
        
        # PageRank (importance in the network)
        pagerank = nx.pagerank(self.graph, weight='weight')
        
        # Betweenness centrality (bridge between other channels)
        betweenness = nx.betweenness_centrality(self.graph, weight='weight')
        
        # In-degree and out-degree
        in_degree = dict(self.graph.in_degree(weight='weight'))
        out_degree = dict(self.graph.out_degree(weight='weight'))
        
        # Combine metrics
        for node in self.graph.nodes():
            metrics[node] = {
                'pagerank': pagerank.get(node, 0),
                'betweenness': betweenness.get(node, 0),
                'in_degree': in_degree.get(node, 0),
                'out_degree': out_degree.get(node, 0),
                'total_degree': in_degree.get(node, 0) + out_degree.get(node, 0)
            }
        
        return pd.DataFrame(metrics).T.sort_values('pagerank', ascending=False)
    
    def find_critical_paths(self, top_n=10):
        """
        Identify most valuable paths through the journey graph.
        """
        print(f"\nFinding top {top_n} critical paths...")
        
        # Get all simple paths (up to length 5 to avoid explosion)
        all_paths = []
        
        # Find source nodes (high out-degree, low in-degree)
        source_candidates = [
            node for node in self.graph.nodes()
            if self.graph.in_degree(node) < self.graph.out_degree(node)
        ]
        
        # Find sink nodes (high in-degree, low out-degree)
        sink_candidates = [
            node for node in self.graph.nodes()
            if self.graph.out_degree(node) < self.graph.in_degree(node)
        ]
        
        for source in source_candidates[:5]:  # Limit sources
            for sink in sink_candidates[:5]:  # Limit sinks
                if source != sink:
                    try:
                        for path in nx.all_simple_paths(self.graph, source, sink, cutoff=5):
                            # Calculate path value
                            path_value = sum(
                                self.graph[path[i]][path[i+1]]['weight']
                                for i in range(len(path) - 1)
                            )
                            all_paths.append({
                                'path': ' > '.join(path),
                                'length': len(path),
                                'total_value': path_value
                            })
                    except nx.NetworkXNoPath:
                        continue
        
        if all_paths:
            paths_df = pd.DataFrame(all_paths)
            return paths_df.sort_values('total_value', ascending=False).head(top_n)
        else:
            return pd.DataFrame()
    
    def visualize_journey_graph(self, top_n_edges=50):
        """
        Visualize the journey graph.
        """
        # Get top N edges by weight
        edges_with_weights = [
            (u, v, d['weight']) 
            for u, v, d in self.graph.edges(data=True)
        ]
        edges_with_weights.sort(key=lambda x: x[2], reverse=True)
        top_edges = edges_with_weights[:top_n_edges]
        
        # Create subgraph with top edges
        subgraph = nx.DiGraph()
        for u, v, w in top_edges:
            subgraph.add_edge(u, v, weight=w)
        
        # Create visualization
        plt.figure(figsize=(16, 12))
        
        # Use spring layout
        pos = nx.spring_layout(subgraph, k=2, iterations=50)
        
        # Draw nodes
        node_sizes = [
            subgraph.degree(node, weight='weight') * 2
            for node in subgraph.nodes()
        ]
        
        nx.draw_networkx_nodes(
            subgraph, pos,
            node_size=node_sizes,
            node_color='lightblue',
            alpha=0.7
        )
        
        # Draw edges
        edge_weights = [subgraph[u][v]['weight'] for u, v in subgraph.edges()]
        max_weight = max(edge_weights)
        edge_widths = [5 * w / max_weight for w in edge_weights]
        
        nx.draw_networkx_edges(
            subgraph, pos,
            width=edge_widths,
            alpha=0.5,
            edge_color='gray',
            arrows=True,
            arrowsize=20
        )
        
        # Draw labels
        nx.draw_networkx_labels(
            subgraph, pos,
            font_size=10,
            font_weight='bold'
        )
        
        plt.title(f"Customer Journey Graph (Top {top_n_edges} Transitions)", 
                 fontsize=16, fontweight='bold')
        plt.axis('off')
        plt.tight_layout()
        plt.show()

# Test graph analysis
print("\n" + "="*60)
print("GRAPH-BASED JOURNEY ANALYSIS")
print("="*60)

graph_analyzer = JourneyGraphAnalyzer()
graph_analyzer.build_journey_graph(journeys_df, min_edge_weight=100)

centrality_metrics = graph_analyzer.calculate_centrality_metrics()
print("\nChannel Centrality Metrics:")
print(centrality_metrics.to_string())

critical_paths = graph_analyzer.find_critical_paths(top_n=10)
if not critical_paths.empty:
    print("\nTop Critical Paths:")
    print(critical_paths.to_string(index=False))

# Visualize
graph_analyzer.visualize_journey_graph(top_n_edges=20)

## 10. Real-World Project: Multi-Touch Attribution System for 10M Customers

### Complete Enterprise Attribution Platform

In [None]:
class EnterpriseAttributionPlatform:
    """
    Complete attribution platform integrating all models.
    Production-ready with Redshift backend and caching.
    """
    
    def __init__(self, rs_connection):
        self.rs = rs_connection
        self.cache = {}
        self.results = {}
        
    def run_all_models(self, start_date, end_date, use_cache=True):
        """
        Run all attribution models and compare results.
        """
        cache_key = f"{start_date}_{end_date}"
        
        if use_cache and cache_key in self.cache:
            print("Using cached results")
            return self.cache[cache_key]
        
        print(f"\nRunning enterprise attribution for {start_date} to {end_date}")
        print("="*70)
        
        # Load data
        loader = AttributionDataLoader(self.rs)
        journeys_df = loader.load_journeys_optimized(start_date, end_date)
        
        total_conversions = len(journeys_df)
        total_revenue = journeys_df['conversion_value'].sum()
        
        print(f"\nDataset: {total_conversions:,} conversions, ${total_revenue:,.2f} revenue")
        
        # 1. Simple models (fast)
        print("\n1. Computing simple attribution models...")
        simple_results = self._compute_simple_models(journeys_df)
        
        # 2. Markov Chain (medium)
        print("\n2. Computing Markov Chain attribution...")
        markov_model = SparseMarkovAttribution()
        markov_model.build_transition_matrix(journeys_df)
        markov_model.calculate_removal_effects()
        markov_results = markov_model.attribute_conversions(total_conversions, total_revenue)
        
        # 3. Shapley values (slow - use sampling)
        print("\n3. Computing Shapley value attribution...")
        sample_size = min(10000, len(journeys_df))
        shapley_sample = journeys_df.sample(n=sample_size, random_state=42)
        shapley_model = ApproximateShapleyAttribution(n_samples=1000)
        shapley_model.approximate_shapley(shapley_sample)
        shapley_results = shapley_model.get_attribution(
            total_conversions, total_revenue
        )
        
        # Combine results
        combined_results = self._combine_model_results(
            simple_results, markov_results, shapley_results
        )
        
        # Cache results
        self.cache[cache_key] = combined_results
        self.results = combined_results
        
        return combined_results
    
    def _compute_simple_models(self, journeys_df):
        """
        Compute first-touch, last-touch, and linear attribution.
        """
        results = defaultdict(lambda: defaultdict(float))
        
        for _, row in journeys_df.iterrows():
            channels = row['journey_path'].split(' > ')
            value = row['conversion_value']
            n_channels = len(channels)
            
            # First touch
            results[channels[0]]['first_touch'] += value
            
            # Last touch
            results[channels[-1]]['last_touch'] += value
            
            # Linear
            linear_credit = value / n_channels
            for channel in channels:
                results[channel]['linear'] += linear_credit
            
            # Time decay
            decay_factor = 0.5
            weights = [decay_factor ** (n_channels - i - 1) for i in range(n_channels)]
            total_weight = sum(weights)
            for i, channel in enumerate(channels):
                results[channel]['time_decay'] += value * weights[i] / total_weight
        
        return dict(results)
    
    def _combine_model_results(self, simple, markov, shapley):
        """
        Combine results from all models into unified DataFrame.
        """
        all_channels = set()
        all_channels.update(simple.keys())
        all_channels.update(markov.keys())
        all_channels.update(shapley.keys())
        
        combined = []
        for channel in sorted(all_channels):
            row = {'channel': channel}
            
            # Simple models
            if channel in simple:
                row['first_touch_revenue'] = simple[channel].get('first_touch', 0)
                row['last_touch_revenue'] = simple[channel].get('last_touch', 0)
                row['linear_revenue'] = simple[channel].get('linear', 0)
                row['time_decay_revenue'] = simple[channel].get('time_decay', 0)
            
            # Markov
            if channel in markov:
                row['markov_revenue'] = markov[channel]['attributed_revenue']
                row['markov_removal_effect'] = markov[channel]['removal_effect']
            
            # Shapley
            if channel in shapley:
                row['shapley_revenue'] = shapley[channel]['attributed_revenue']
                row['shapley_value'] = shapley[channel]['shapley_value']
            
            combined.append(row)
        
        df = pd.DataFrame(combined).fillna(0)
        
        # Add ensemble model (average of data-driven models)
        df['ensemble_revenue'] = df[['markov_revenue', 'shapley_revenue', 'linear_revenue']].mean(axis=1)
        
        return df.sort_values('ensemble_revenue', ascending=False)
    
    def save_to_redshift(self, table_name='attribution_results'):
        """
        Save attribution results to Redshift.
        """
        if not self.results:
            print("No results to save. Run attribution first.")
            return
        
        # Add metadata
        self.results['computation_date'] = datetime.now()
        self.results['model_version'] = 'v1.0'
        
        # Save to Redshift
        self.results.to_sql(
            table_name,
            self.rs.engine,
            if_exists='append',
            index=False,
            method='multi'
        )
        
        print(f"✓ Results saved to {table_name}")
    
    def generate_comparison_report(self):
        """
        Generate visual comparison of all attribution models.
        """
        if not hasattr(self, 'results') or self.results is None:
            print("No results available. Run attribution first.")
            return
        
        df = self.results
        
        # Create comparison visualizations
        fig, axes = plt.subplots(2, 2, figsize=(18, 12))
        
        # 1. Model comparison by channel
        ax1 = axes[0, 0]
        model_cols = ['first_touch_revenue', 'last_touch_revenue', 'linear_revenue', 
                     'markov_revenue', 'shapley_revenue']
        df[['channel'] + model_cols].set_index('channel').plot(kind='bar', ax=ax1)
        ax1.set_title('Attribution by Model', fontsize=14, fontweight='bold')
        ax1.set_xlabel('Channel')
        ax1.set_ylabel('Attributed Revenue ($)')
        ax1.legend(title='Model', bbox_to_anchor=(1.05, 1))
        ax1.tick_params(axis='x', rotation=45)
        
        # 2. Model correlation heatmap
        ax2 = axes[0, 1]
        correlation = df[model_cols].corr()
        sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', ax=ax2)
        ax2.set_title('Model Correlation', fontsize=14, fontweight='bold')
        
        # 3. Distribution comparison
        ax3 = axes[1, 0]
        ensemble_data = df[['channel', 'ensemble_revenue']].sort_values('ensemble_revenue', ascending=False)
        ax3.pie(ensemble_data['ensemble_revenue'], labels=ensemble_data['channel'], autopct='%1.1f%%')
        ax3.set_title('Ensemble Model Distribution', fontsize=14, fontweight='bold')
        
        # 4. Markov removal effect
        ax4 = axes[1, 1]
        markov_data = df[['channel', 'markov_removal_effect']].sort_values('markov_removal_effect', ascending=False)
        ax4.barh(markov_data['channel'], markov_data['markov_removal_effect'])
        ax4.set_title('Markov Removal Effects', fontsize=14, fontweight='bold')
        ax4.set_xlabel('Removal Effect')
        
        plt.tight_layout()
        plt.show()
        
        # Print summary table
        print("\n" + "="*70)
        print("ATTRIBUTION MODEL COMPARISON")
        print("="*70)
        print(df.to_string(index=False))

print("✓ Enterprise Attribution Platform configured")
print("\nReady to process 10M+ customer journeys with:")
print("  - Multiple attribution models (First/Last touch, Linear, Markov, Shapley)")
print("  - Distributed processing with Dask")
print("  - Redshift integration for data storage")
print("  - Graph-based journey analysis")
print("  - Production-ready caching and optimization")

## 11. Performance Monitoring and Optimization

In [None]:
import time
from functools import wraps
import psutil

def performance_monitor(func):
    """
    Decorator to monitor function performance.
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        # Memory before
        process = psutil.Process()
        mem_before = process.memory_info().rss / 1024 / 1024  # MB
        
        # Time execution
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        
        # Memory after
        mem_after = process.memory_info().rss / 1024 / 1024  # MB
        
        print(f"\n[Performance] {func.__name__}:")
        print(f"  Execution time: {end_time - start_time:.2f} seconds")
        print(f"  Memory before: {mem_before:.2f} MB")
        print(f"  Memory after: {mem_after:.2f} MB")
        print(f"  Memory delta: {mem_after - mem_before:.2f} MB")
        
        return result
    return wrapper

# Example usage
@performance_monitor
def test_large_scale_processing():
    """Test processing performance"""
    test_df = generate_large_scale_journeys(n_journeys=10000, save_to_file=False)
    return len(test_df)

n_journeys = test_large_scale_processing()
print(f"\n✓ Processed {n_journeys:,} journeys")

## 12. Summary and Best Practices

### Key Takeaways

1. **Data Architecture**
   - Use Redshift DISTKEY and SORTKEY for optimal performance
   - Pre-aggregate journeys when possible
   - Partition large tables by date

2. **Algorithm Selection**
   - Use sparse matrices for Markov chains (100x memory reduction)
   - Approximate Shapley values with Monte Carlo sampling
   - Distribute computation with Dask for 100M+ rows

3. **Performance Optimization**
   - Load data in chunks
   - Use columnar formats (Parquet)
   - Implement caching for repeated queries
   - Profile and monitor memory usage

4. **Production Deployment**
   - Implement connection pooling
   - Add error handling and retries
   - Schedule batch processing during off-hours
   - Monitor query performance in Redshift

### Next Steps
- Implement real-time attribution streaming
- Build attribution API with FastAPI
- Create automated reporting dashboards
- Integrate with marketing automation platforms