# Week 1: Advanced Python at Scale

## Learning Objectives
- Work with large datasets (millions of rows) efficiently
- Master memory-efficient techniques (generators, chunking)
- Implement parallel processing with multiprocessing
- Connect to AWS Redshift from Python
- Optimize Python code for production performance
- Apply production-ready code patterns
- Leverage advanced data structures
- Implement robust error handling and logging

## Prerequisites
```bash
pip install psycopg2-binary sqlalchemy pandas numpy boto3 memory_profiler
pip install pyarrow fastparquet dask[complete]
```

## 1. Setup and Imports

In [None]:
# Standard library
import os
import sys
import logging
import time
from datetime import datetime, timedelta
from collections import defaultdict, Counter, deque
from itertools import islice, groupby, chain
from functools import lru_cache, partial
from typing import Iterator, Generator, List, Dict, Tuple
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import gc

# Data manipulation
import numpy as np
import pandas as pd

# Database connectivity
import psycopg2
from psycopg2 import pool
from sqlalchemy import create_engine
import boto3

# Performance monitoring
from memory_profiler import profile
import psutil

# Configuration
import warnings
warnings.filterwarnings('ignore')

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('advanced_python.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"CPU cores: {mp.cpu_count()}")

## 2. Redshift Connection Setup

### Production-Ready Connection Patterns

In [None]:
class RedshiftConnection:
    """Production-ready Redshift connection manager with pooling."""
    
    def __init__(self, config: Dict[str, str]):
        self.config = config
        self.connection_pool = None
        self.engine = None
        self.logger = logging.getLogger(self.__class__.__name__)
        
    def create_connection_pool(self, min_conn: int = 1, max_conn: int = 10):
        """Create a connection pool for efficient connection management."""
        try:
            self.connection_pool = psycopg2.pool.ThreadedConnectionPool(
                min_conn,
                max_conn,
                host=self.config['host'],
                port=self.config.get('port', 5439),
                database=self.config['database'],
                user=self.config['user'],
                password=self.config['password'],
                connect_timeout=10,
                options='-c statement_timeout=300000'  # 5 minute timeout
            )
            self.logger.info(f"Connection pool created: {min_conn}-{max_conn} connections")
        except Exception as e:
            self.logger.error(f"Failed to create connection pool: {e}")
            raise
    
    def create_sqlalchemy_engine(self):
        """Create SQLAlchemy engine for pandas integration."""
        connection_string = (
            f"postgresql+psycopg2://{self.config['user']}:{self.config['password']}"
            f"@{self.config['host']}:{self.config.get('port', 5439)}/{self.config['database']}"
        )
        self.engine = create_engine(
            connection_string,
            pool_size=10,
            max_overflow=20,
            pool_pre_ping=True,  # Verify connections before using
            pool_recycle=3600,   # Recycle connections after 1 hour
            echo=False
        )
        self.logger.info("SQLAlchemy engine created")
        return self.engine
    
    def get_connection(self):
        """Get a connection from the pool."""
        if not self.connection_pool:
            self.create_connection_pool()
        return self.connection_pool.getconn()
    
    def return_connection(self, conn):
        """Return a connection to the pool."""
        if self.connection_pool:
            self.connection_pool.putconn(conn)
    
    def close_all(self):
        """Close all connections."""
        if self.connection_pool:
            self.connection_pool.closeall()
            self.logger.info("All connections closed")
        if self.engine:
            self.engine.dispose()


# Example configuration (use environment variables in production)
REDSHIFT_CONFIG = {
    'host': os.getenv('REDSHIFT_HOST', 'your-cluster.region.redshift.amazonaws.com'),
    'port': int(os.getenv('REDSHIFT_PORT', 5439)),
    'database': os.getenv('REDSHIFT_DB', 'marketing'),
    'user': os.getenv('REDSHIFT_USER', 'analyst'),
    'password': os.getenv('REDSHIFT_PASSWORD', 'your-password')
}

# Initialize connection manager
# rs_conn = RedshiftConnection(REDSHIFT_CONFIG)
# engine = rs_conn.create_sqlalchemy_engine()

print("✓ Connection manager configured")

## 3. Memory-Efficient Data Processing

### 3.1 Generators for Large Datasets

In [None]:
def chunked_file_reader(file_path: str, chunk_size: int = 100000) -> Generator:
    """Read large CSV files in chunks using generators."""
    logger.info(f"Reading file in chunks of {chunk_size} rows")
    
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        yield chunk


def chunked_redshift_reader(query: str, engine, chunk_size: int = 50000) -> Generator:
    """Read large Redshift query results in chunks."""
    logger.info(f"Executing query with chunk size {chunk_size}")
    
    for chunk in pd.read_sql(query, engine, chunksize=chunk_size):
        yield chunk


def process_chunk(chunk: pd.DataFrame) -> Dict:
    """Process a single chunk and return aggregated results."""
    return {
        'rows': len(chunk),
        'revenue': chunk.get('revenue', pd.Series([0])).sum(),
        'unique_users': chunk.get('user_id', pd.Series()).nunique()
    }


# Example: Process large file without loading into memory
def process_large_file_streaming(file_path: str) -> Dict:
    """Process large file using streaming approach."""
    results = {
        'total_rows': 0,
        'total_revenue': 0,
        'unique_users': set()
    }
    
    start_time = time.time()
    
    for chunk_num, chunk in enumerate(chunked_file_reader(file_path, chunk_size=100000), 1):
        results['total_rows'] += len(chunk)
        results['total_revenue'] += chunk.get('revenue', 0).sum()
        results['unique_users'].update(chunk.get('user_id', pd.Series()).unique())
        
        if chunk_num % 10 == 0:
            logger.info(f"Processed {chunk_num} chunks, {results['total_rows']:,} rows")
            gc.collect()  # Force garbage collection
    
    results['unique_users'] = len(results['unique_users'])
    results['processing_time'] = time.time() - start_time
    
    return results


print("✓ Streaming processing functions defined")

### 3.2 Memory Profiling

In [None]:
def get_memory_usage():
    """Get current memory usage."""
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return {
        'rss_mb': mem_info.rss / 1024 / 1024,
        'vms_mb': mem_info.vms / 1024 / 1024,
        'percent': process.memory_percent()
    }


def memory_efficient_dataframe(size: int = 1000000):
    """Create memory-efficient DataFrames using appropriate dtypes."""
    
    # Inefficient approach
    print("Creating inefficient DataFrame...")
    mem_before = get_memory_usage()['rss_mb']
    
    df_inefficient = pd.DataFrame({
        'user_id': np.random.randint(1, 100000, size),
        'campaign_id': np.random.randint(1, 1000, size),
        'revenue': np.random.uniform(0, 1000, size),
        'timestamp': pd.date_range('2024-01-01', periods=size, freq='1s')
    })
    
    mem_after_inefficient = get_memory_usage()['rss_mb']
    print(f"Inefficient DataFrame memory: {df_inefficient.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"Process memory increased by: {mem_after_inefficient - mem_before:.2f} MB\n")
    
    # Efficient approach
    print("Creating efficient DataFrame...")
    mem_before = get_memory_usage()['rss_mb']
    
    df_efficient = pd.DataFrame({
        'user_id': np.random.randint(1, 100000, size, dtype=np.int32),
        'campaign_id': np.random.randint(1, 1000, size, dtype=np.int16),
        'revenue': np.random.uniform(0, 1000, size).astype(np.float32),
        'timestamp': pd.date_range('2024-01-01', periods=size, freq='1s')
    })
    
    mem_after_efficient = get_memory_usage()['rss_mb']
    print(f"Efficient DataFrame memory: {df_efficient.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"Process memory increased by: {mem_after_efficient - mem_before:.2f} MB\n")
    
    print(f"Memory savings: {(df_inefficient.memory_usage(deep=True).sum() - df_efficient.memory_usage(deep=True).sum()) / df_inefficient.memory_usage(deep=True).sum() * 100:.1f}%")
    
    return df_efficient


# Test memory optimization
# df = memory_efficient_dataframe(1000000)

## 4. Parallel Processing

### 4.1 Multiprocessing for CPU-Intensive Tasks

In [None]:
def process_partition(data_chunk: pd.DataFrame) -> pd.DataFrame:
    """Process a partition of data (CPU-intensive operations)."""
    # Simulate complex processing
    result = data_chunk.copy()
    
    # Calculate rolling metrics
    result['revenue_ma_7d'] = result.groupby('user_id')['revenue'].transform(
        lambda x: x.rolling(7, min_periods=1).mean()
    )
    
    # Calculate user segments
    result['user_segment'] = pd.cut(
        result['revenue'],
        bins=[0, 100, 500, 1000, float('inf')],
        labels=['low', 'medium', 'high', 'vip']
    )
    
    return result


def parallel_process_dataframe(df: pd.DataFrame, n_workers: int = None) -> pd.DataFrame:
    """Process DataFrame in parallel using multiprocessing."""
    if n_workers is None:
        n_workers = max(1, mp.cpu_count() - 1)
    
    logger.info(f"Processing with {n_workers} workers")
    
    # Split dataframe into chunks
    chunks = np.array_split(df, n_workers)
    
    start_time = time.time()
    
    # Process in parallel
    with ProcessPoolExecutor(max_workers=n_workers) as executor:
        results = list(executor.map(process_partition, chunks))
    
    # Combine results
    final_result = pd.concat(results, ignore_index=True)
    
    elapsed = time.time() - start_time
    logger.info(f"Parallel processing completed in {elapsed:.2f}s")
    
    return final_result


# Example: Compare serial vs parallel processing
def benchmark_parallel_processing(size: int = 1000000):
    """Benchmark serial vs parallel processing."""
    # Generate test data
    df = pd.DataFrame({
        'user_id': np.random.randint(1, 10000, size),
        'revenue': np.random.uniform(0, 1000, size),
        'timestamp': pd.date_range('2024-01-01', periods=size, freq='1min')
    })
    
    print(f"Dataset: {len(df):,} rows\n")
    
    # Serial processing
    print("Serial processing...")
    start = time.time()
    result_serial = process_partition(df)
    serial_time = time.time() - start
    print(f"Time: {serial_time:.2f}s\n")
    
    # Parallel processing
    print("Parallel processing...")
    start = time.time()
    result_parallel = parallel_process_dataframe(df, n_workers=4)
    parallel_time = time.time() - start
    print(f"Time: {parallel_time:.2f}s\n")
    
    speedup = serial_time / parallel_time
    print(f"Speedup: {speedup:.2f}x")
    
    return result_parallel


# Run benchmark
# result = benchmark_parallel_processing(100000)

### 4.2 Thread Pool for I/O-Bound Tasks

In [None]:
def fetch_campaign_data(campaign_id: int, engine) -> pd.DataFrame:
    """Fetch data for a single campaign (I/O-bound)."""
    query = f"""
    SELECT *
    FROM marketing_events
    WHERE campaign_id = {campaign_id}
    LIMIT 10000
    """
    return pd.read_sql(query, engine)


def parallel_fetch_campaigns(campaign_ids: List[int], engine, max_workers: int = 10) -> Dict[int, pd.DataFrame]:
    """Fetch multiple campaigns in parallel using thread pool."""
    results = {}
    
    start_time = time.time()
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        future_to_campaign = {
            executor.submit(fetch_campaign_data, cid, engine): cid 
            for cid in campaign_ids
        }
        
        # Collect results
        for future in future_to_campaign:
            campaign_id = future_to_campaign[future]
            try:
                results[campaign_id] = future.result(timeout=30)
            except Exception as e:
                logger.error(f"Campaign {campaign_id} failed: {e}")
                results[campaign_id] = pd.DataFrame()
    
    elapsed = time.time() - start_time
    logger.info(f"Fetched {len(campaign_ids)} campaigns in {elapsed:.2f}s")
    
    return results


print("✓ Parallel I/O functions defined")

## 5. Advanced Data Structures

### 5.1 Collections for Performance

In [None]:
from collections import defaultdict, Counter, deque, namedtuple
from itertools import groupby, islice

# Named tuples for clarity and memory efficiency
Event = namedtuple('Event', ['user_id', 'event_type', 'timestamp', 'value'])


def analyze_user_journey(events: List[Event]) -> Dict:
    """Analyze user journey using advanced data structures."""
    
    # defaultdict for automatic initialization
    user_paths = defaultdict(list)
    event_counts = Counter()
    
    # Group events by user
    events_sorted = sorted(events, key=lambda x: (x.user_id, x.timestamp))
    
    for user_id, user_events in groupby(events_sorted, key=lambda x: x.user_id):
        user_events_list = list(user_events)
        
        # Store journey path
        user_paths[user_id] = [e.event_type for e in user_events_list]
        
        # Count event types
        event_counts.update([e.event_type for e in user_events_list])
    
    return {
        'total_users': len(user_paths),
        'avg_journey_length': np.mean([len(p) for p in user_paths.values()]),
        'most_common_events': event_counts.most_common(5),
        'sample_journey': list(islice(user_paths.items(), 3))
    }


# Example: Generate sample events
def generate_sample_events(n: int = 100000) -> List[Event]:
    """Generate sample events for testing."""
    event_types = ['page_view', 'click', 'conversion', 'cart_add', 'purchase']
    
    events = [
        Event(
            user_id=np.random.randint(1, 10000),
            event_type=np.random.choice(event_types),
            timestamp=datetime.now() - timedelta(seconds=np.random.randint(0, 86400)),
            value=np.random.uniform(0, 100)
        )
        for _ in range(n)
    ]
    
    return events


# Test
events = generate_sample_events(10000)
analysis = analyze_user_journey(events)
print(f"Analyzed {analysis['total_users']} users")
print(f"Average journey length: {analysis['avg_journey_length']:.1f}")
print(f"Most common events: {analysis['most_common_events']}")

### 5.2 LRU Cache for Expensive Operations

In [None]:
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user_segment(user_id: int, revenue: float) -> str:
    """Expensive user segmentation logic (cached)."""
    # Simulate expensive computation
    time.sleep(0.001)
    
    if revenue < 100:
        return 'low'
    elif revenue < 500:
        return 'medium'
    elif revenue < 1000:
        return 'high'
    else:
        return 'vip'


def benchmark_caching(n_calls: int = 10000):
    """Benchmark caching performance."""
    user_ids = np.random.randint(1, 100, n_calls)  # Limited user IDs for cache hits
    revenues = np.random.uniform(0, 1000, n_calls)
    
    # With cache
    get_user_segment.cache_clear()
    start = time.time()
    results = [get_user_segment(uid, rev) for uid, rev in zip(user_ids, revenues)]
    cached_time = time.time() - start
    
    cache_info = get_user_segment.cache_info()
    print(f"Cached execution time: {cached_time:.3f}s")
    print(f"Cache hits: {cache_info.hits}, misses: {cache_info.misses}")
    print(f"Hit rate: {cache_info.hits / (cache_info.hits + cache_info.misses) * 100:.1f}%")


benchmark_caching(10000)

## 6. Production-Ready Error Handling

In [None]:
from typing import Optional
import traceback
from contextlib import contextmanager


class DataProcessingError(Exception):
    """Custom exception for data processing errors."""
    pass


@contextmanager
def error_handler(operation_name: str, raise_on_error: bool = False):
    """Context manager for consistent error handling."""
    try:
        logger.info(f"Starting: {operation_name}")
        yield
        logger.info(f"Completed: {operation_name}")
    except Exception as e:
        logger.error(f"Error in {operation_name}: {str(e)}")
        logger.error(traceback.format_exc())
        if raise_on_error:
            raise


def retry_with_backoff(func, max_retries: int = 3, backoff_factor: int = 2):
    """Retry function with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                logger.error(f"Failed after {max_retries} attempts")
                raise
            
            wait_time = backoff_factor ** attempt
            logger.warning(f"Attempt {attempt + 1} failed, retrying in {wait_time}s: {e}")
            time.sleep(wait_time)


def safe_divide(a: float, b: float, default: float = 0.0) -> float:
    """Safely divide two numbers."""
    try:
        return a / b if b != 0 else default
    except (TypeError, ZeroDivisionError) as e:
        logger.warning(f"Division error: {e}, returning {default}")
        return default


# Example usage
with error_handler("Data processing"):
    # Your code here
    result = safe_divide(100, 0)
    print(f"Result: {result}")

## 7. Performance Optimization Techniques

In [None]:
import timeit


def benchmark_operations():
    """Benchmark different approaches to common operations."""
    
    # Setup
    data = list(range(1000000))
    
    print("Benchmarking list operations...\n")
    
    # List comprehension vs map
    time_listcomp = timeit.timeit(
        lambda: [x * 2 for x in range(100000)],
        number=10
    )
    
    time_map = timeit.timeit(
        lambda: list(map(lambda x: x * 2, range(100000))),
        number=10
    )
    
    print(f"List comprehension: {time_listcomp:.4f}s")
    print(f"Map function: {time_map:.4f}s")
    print(f"Winner: {'List comprehension' if time_listcomp < time_map else 'Map'}\n")
    
    # NumPy vectorization vs list comprehension
    arr = np.arange(1000000)
    
    time_numpy = timeit.timeit(
        lambda: arr * 2,
        number=10
    )
    
    time_list = timeit.timeit(
        lambda: [x * 2 for x in range(1000000)],
        number=10
    )
    
    print(f"NumPy vectorization: {time_numpy:.4f}s")
    print(f"List comprehension: {time_list:.4f}s")
    print(f"NumPy speedup: {time_list / time_numpy:.1f}x")


benchmark_operations()

## 8. Real-World Project: Process 10M Row Dataset

### Project: Marketing Campaign Performance Analysis

In [None]:
class MarketingDataProcessor:
    """Production-ready marketing data processor."""
    
    def __init__(self, chunk_size: int = 100000, n_workers: int = None):
        self.chunk_size = chunk_size
        self.n_workers = n_workers or max(1, mp.cpu_count() - 1)
        self.logger = logging.getLogger(self.__class__.__name__)
        self.metrics = defaultdict(float)
    
    def generate_sample_data(self, output_file: str, n_rows: int = 10000000):
        """Generate sample marketing data."""
        self.logger.info(f"Generating {n_rows:,} rows of sample data")
        
        chunks_to_write = n_rows // self.chunk_size
        
        for i in range(chunks_to_write):
            chunk = pd.DataFrame({
                'event_id': range(i * self.chunk_size, (i + 1) * self.chunk_size),
                'user_id': np.random.randint(1, 1000000, self.chunk_size),
                'campaign_id': np.random.randint(1, 10000, self.chunk_size),
                'channel': np.random.choice(['email', 'social', 'search', 'display'], self.chunk_size),
                'event_type': np.random.choice(['impression', 'click', 'conversion'], self.chunk_size, p=[0.8, 0.15, 0.05]),
                'revenue': np.random.exponential(50, self.chunk_size),
                'timestamp': pd.date_range('2024-01-01', periods=self.chunk_size, freq='1s')
            })
            
            mode = 'w' if i == 0 else 'a'
            header = i == 0
            chunk.to_csv(output_file, mode=mode, header=header, index=False)
            
            if (i + 1) % 10 == 0:
                self.logger.info(f"Written {(i + 1) * self.chunk_size:,} rows")
        
        self.logger.info(f"Data generation complete: {output_file}")
    
    def process_chunk_metrics(self, chunk: pd.DataFrame) -> Dict:
        """Calculate metrics for a chunk."""
        metrics = {}
        
        # Campaign metrics
        campaign_stats = chunk.groupby('campaign_id').agg({
            'event_id': 'count',
            'revenue': 'sum',
            'user_id': 'nunique'
        }).rename(columns={
            'event_id': 'impressions',
            'revenue': 'revenue',
            'user_id': 'unique_users'
        })
        
        # Channel performance
        channel_stats = chunk.groupby('channel').agg({
            'revenue': 'sum',
            'event_id': 'count'
        })
        
        # Conversion metrics
        conversions = chunk[chunk['event_type'] == 'conversion']
        
        return {
            'campaign_stats': campaign_stats,
            'channel_stats': channel_stats,
            'total_revenue': chunk['revenue'].sum(),
            'total_events': len(chunk),
            'conversions': len(conversions)
        }
    
    def process_file_streaming(self, input_file: str) -> Dict:
        """Process large file using streaming."""
        self.logger.info(f"Processing file: {input_file}")
        start_time = time.time()
        
        # Accumulated results
        campaign_stats_list = []
        channel_stats_list = []
        total_revenue = 0
        total_events = 0
        total_conversions = 0
        
        # Process in chunks
        for chunk_num, chunk in enumerate(pd.read_csv(input_file, chunksize=self.chunk_size), 1):
            metrics = self.process_chunk_metrics(chunk)
            
            campaign_stats_list.append(metrics['campaign_stats'])
            channel_stats_list.append(metrics['channel_stats'])
            total_revenue += metrics['total_revenue']
            total_events += metrics['total_events']
            total_conversions += metrics['conversions']
            
            if chunk_num % 10 == 0:
                self.logger.info(f"Processed {total_events:,} events")
                gc.collect()
        
        # Combine results
        final_campaign_stats = pd.concat(campaign_stats_list).groupby(level=0).sum()
        final_channel_stats = pd.concat(channel_stats_list).groupby(level=0).sum()
        
        processing_time = time.time() - start_time
        
        return {
            'campaign_stats': final_campaign_stats,
            'channel_stats': final_channel_stats,
            'total_revenue': total_revenue,
            'total_events': total_events,
            'total_conversions': total_conversions,
            'conversion_rate': total_conversions / total_events * 100,
            'processing_time': processing_time,
            'throughput': total_events / processing_time
        }


# Example usage
processor = MarketingDataProcessor(chunk_size=100000)

# Generate sample data (smaller for demo)
# processor.generate_sample_data('marketing_data_10m.csv', n_rows=1000000)

# Process the data
# results = processor.process_file_streaming('marketing_data_10m.csv')
# print(f"\nProcessing Summary:")
# print(f"Total events: {results['total_events']:,}")
# print(f"Total revenue: ${results['total_revenue']:,.2f}")
# print(f"Conversion rate: {results['conversion_rate']:.2f}%")
# print(f"Processing time: {results['processing_time']:.2f}s")
# print(f"Throughput: {results['throughput']:,.0f} events/second")

print("✓ Marketing data processor defined")

## 9. Best Practices Summary

### Memory Management
1. Use appropriate data types (int32 instead of int64)
2. Process data in chunks for large datasets
3. Use generators instead of lists when possible
4. Call `gc.collect()` after processing large chunks
5. Use `del` to explicitly free memory

### Performance
1. Use NumPy vectorization over loops
2. Use multiprocessing for CPU-bound tasks
3. Use threading for I/O-bound tasks
4. Cache expensive function results with `@lru_cache`
5. Profile code to identify bottlenecks

### Production Patterns
1. Implement connection pooling for databases
2. Add comprehensive error handling
3. Use structured logging
4. Implement retry logic with backoff
5. Monitor memory and CPU usage
6. Use context managers for resource cleanup

### Redshift Optimization
1. Use appropriate distribution and sort keys
2. Fetch data in chunks with `chunksize`
3. Use connection pooling
4. Set appropriate timeouts
5. Use UNLOAD for large exports
6. Minimize data transfer with projections

## 10. Exercises

### Exercise 1: Memory-Efficient Processing
Download a large CSV from Kaggle (e.g., "Google Analytics Customer Revenue Prediction") and:
1. Calculate total revenue without loading entire file into memory
2. Find top 10 customers by revenue using streaming approach
3. Compare memory usage with full-load approach

### Exercise 2: Parallel Processing
1. Create a CPU-intensive transformation function
2. Benchmark serial vs parallel execution
3. Find optimal number of workers
4. Measure speedup and efficiency

### Exercise 3: Production Code
Refactor a data processing script to include:
1. Proper error handling
2. Logging at appropriate levels
3. Performance metrics
4. Configuration management
5. Unit tests

### Exercise 4: Redshift Integration
1. Create a connection pool manager
2. Implement chunked data loading from Redshift
3. Add retry logic for transient failures
4. Monitor query performance
5. Implement efficient data upload to Redshift

## Resources

### Documentation
- [Python Performance Tips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips)
- [Psycopg2 Documentation](https://www.psycopg.org/docs/)
- [AWS Redshift Best Practices](https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html)
- [Multiprocessing Guide](https://docs.python.org/3/library/multiprocessing.html)

### Kaggle Datasets
- Google Analytics Customer Revenue Prediction
- Retail Data Analytics
- E-commerce Dataset
- Marketing Analytics Dataset

### Tools
- memory_profiler: Profile memory usage
- line_profiler: Line-by-line performance profiling
- py-spy: Sampling profiler
- psutil: System and process utilities