# Week 10 - Project Integration and Polish
## Part 3: Performance Optimization and Caching

### Learning Objectives
By the end of this session, you will be able to:
- Implement advanced caching strategies for large datasets
- Optimize database queries and data loading
- Handle real-time data updates efficiently
- Monitor and debug performance issues
- Scale applications for multiple concurrent users
- Implement progressive loading and user feedback

### Business Context
Performance directly impacts user adoption and business value:

- **User Experience**: Slow apps frustrate stakeholders and reduce usage
- **Operational Costs**: Inefficient apps consume more resources
- **Scalability**: Poor performance limits how many users can access insights
- **Reliability**: Performance issues can cause app crashes and data loss

**The Goal**: Create applications that are fast, reliable, and scalable for business use.

## Setup and Imports

In [None]:
# Performance optimization imports
import streamlit as st
import pandas as pd
import numpy as np
import time
import psutil
import threading
from datetime import datetime, timedelta
import hashlib
import pickle
import os
from functools import wraps
import plotly.express as px
import plotly.graph_objects as go

# Helper utilities
import sys
sys.path.append('/content/python-data-analysis-course')
from Utilities.colab_helper import setup_colab
from Utilities.olist_helper import load_olist_data

setup_colab()
print("✅ Performance optimization environment ready")

## 1. Performance Fundamentals

### Understanding Performance Bottlenecks

**Common Performance Issues in Data Applications**:

1. **Data Loading**: Reading large datasets repeatedly
2. **Data Processing**: Complex calculations on every interaction
3. **Visualization**: Rendering complex charts with too much data
4. **Network Requests**: API calls and database queries
5. **Memory Usage**: Loading everything into memory
6. **State Management**: Inefficient session state handling

### Performance Measurement

```python
# Performance monitoring utilities
def measure_performance(func):
    """Decorator to measure function execution time"""
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        execution_time = end_time - start_time
        print(f"{func.__name__} executed in {execution_time:.2f} seconds")
        return result
    return wrapper

def get_memory_usage():
    """Get current memory usage"""
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    return memory_info.rss / 1024 / 1024  # Convert to MB
```

### Performance Targets for Business Apps

| Metric | Target | Business Impact |
|--------|--------|----------------|
| Initial Load | < 10 seconds | User retention |
| User Interactions | < 2 seconds | User satisfaction |
| Data Refresh | < 5 seconds | Workflow efficiency |
| Memory Usage | < 1GB | Cost efficiency |
| Concurrent Users | 10-50 users | Business scalability |

## 2. Advanced Caching Strategies

### Understanding Streamlit Caching

Streamlit provides two main caching decorators:

- **`@st.cache_data`**: For data loading and processing
- **`@st.cache_resource`**: For global resources (database connections, models)

### Pattern 1: Hierarchical Data Caching

In [None]:
# Advanced caching patterns

# Level 1: Raw data caching (rarely changes)
@st.cache_data(ttl=3600)  # Cache for 1 hour
def load_raw_olist_data():
    """Load raw Olist datasets - cached for long periods"""
    with st.spinner("Loading raw datasets..."):
        return {
            'orders': load_olist_data('olist_orders_dataset'),
            'reviews': load_olist_data('olist_order_reviews_dataset'),
            'customers': load_olist_data('olist_customers_dataset'),
            'products': load_olist_data('olist_products_dataset')
        }

# Level 2: Processed data caching (changes with parameters)
@st.cache_data(ttl=600)  # Cache for 10 minutes
def prepare_analysis_data(date_range, customer_states):
    """Prepare data based on user filters - shorter cache"""
    raw_data = load_raw_olist_data()
    
    # Filter and process based on parameters
    start_date, end_date = date_range
    
    orders = raw_data['orders']
    orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
    
    # Apply filters
    filtered_orders = orders[
        (orders['order_purchase_timestamp'].dt.date >= start_date) &
        (orders['order_purchase_timestamp'].dt.date <= end_date)
    ]
    
    # Merge with other datasets
    analysis_data = filtered_orders.merge(raw_data['reviews'], on='order_id', how='inner')
    analysis_data = analysis_data.merge(raw_data['customers'], on='customer_id', how='left')
    
    if customer_states:
        analysis_data = analysis_data[analysis_data['customer_state'].isin(customer_states)]
    
    return analysis_data

# Level 3: Analysis results caching (changes with all parameters)
@st.cache_data(ttl=300)  # Cache for 5 minutes
def calculate_performance_metrics(data, metric_type):
    """Calculate specific metrics - shortest cache"""
    
    if metric_type == 'satisfaction':
        return {
            'avg_score': data['review_score'].mean(),
            'satisfaction_dist': data['review_score'].value_counts().to_dict(),
            'total_reviews': len(data)
        }
    elif metric_type == 'delivery':
        data['delivery_days'] = (pd.to_datetime(data['order_delivered_customer_date']) - 
                                pd.to_datetime(data['order_purchase_timestamp'])).dt.days
        return {
            'avg_delivery_days': data['delivery_days'].mean(),
            'delivery_dist': data['delivery_days'].describe().to_dict()
        }

print("✅ Advanced caching patterns defined")

### Pattern 2: Smart Cache Invalidation

In [None]:
# Smart cache management

def create_cache_key(*args, **kwargs):
    """Create a unique cache key from parameters"""
    key_string = str(args) + str(sorted(kwargs.items()))
    return hashlib.md5(key_string.encode()).hexdigest()

@st.cache_data(ttl=1800)  # 30 minute cache
def load_filtered_data(date_start, date_end, states, _cache_key=None):
    """Load data with manual cache key for complex invalidation"""
    # The _cache_key parameter allows manual cache control
    # Changing this key will invalidate the cache
    
    print(f"Loading data for period {date_start} to {date_end}")
    print(f"States: {states}")
    print(f"Cache key: {_cache_key}")
    
    # Simulate data loading
    time.sleep(2)  # Simulate slow operation
    
    return pd.DataFrame({
        'data_loaded_at': [datetime.now()],
        'cache_key': [_cache_key],
        'filter_applied': [f"{date_start} to {date_end}, states: {len(states)}"]
    })

def get_data_with_smart_caching(date_start, date_end, states):
    """Wrapper function that manages cache keys intelligently"""
    
    # Create cache key based on parameters
    cache_key = create_cache_key(date_start, date_end, tuple(sorted(states)))
    
    # Add time-based component (optional - for periodic refresh)
    hour_key = datetime.now().strftime('%Y%m%d%H')  # Refresh every hour
    final_cache_key = f"{cache_key}_{hour_key}"
    
    return load_filtered_data(date_start, date_end, states, _cache_key=final_cache_key)

# Cache monitoring and clearing
def display_cache_info():
    """Display current cache status"""
    if st.button("Clear All Caches"):
        st.cache_data.clear()
        st.success("All caches cleared!")
        st.experimental_rerun()
    
    st.info("💡 Tip: Caches automatically expire based on TTL (Time To Live) settings")

print("✅ Smart cache invalidation patterns defined")

## 3. Database Integration and Query Optimization

### Efficient Database Patterns

When working with databases (like Supabase), optimization is crucial for performance.

In [None]:
# Database optimization patterns

# Pattern 1: Connection pooling
@st.cache_resource
def get_database_connection():
    """Create a cached database connection - shared across users"""
    # In real app, this would be your database connection
    print("Creating database connection...")
    
    # Simulated connection object
    class MockConnection:
        def __init__(self):
            self.created_at = datetime.now()
            self.query_count = 0
        
        def execute_query(self, query):
            self.query_count += 1
            print(f"Executing query #{self.query_count}: {query[:50]}...")
            time.sleep(0.1)  # Simulate query time
            return f"Query result #{self.query_count}"
    
    return MockConnection()

# Pattern 2: Query optimization with parameters
@st.cache_data(ttl=600)
def execute_optimized_query(query, params=None, _connection=None):
    """Execute database query with caching and optimization"""
    
    if _connection is None:
        _connection = get_database_connection()
    
    # Add query optimization hints
    optimized_query = optimize_query(query, params)
    
    return _connection.execute_query(optimized_query)

def optimize_query(query, params):
    """Apply query optimization techniques"""
    
    # Example optimizations:
    optimizations = []
    
    # Add LIMIT for large result sets
    if 'LIMIT' not in query.upper() and params and params.get('limit'):
        query += f" LIMIT {params['limit']}"
        optimizations.append("Added LIMIT clause")
    
    # Add indexes hints (database specific)
    if params and params.get('use_index'):
        # This would be database-specific optimization
        optimizations.append("Added index hint")
    
    if optimizations:
        print(f"Query optimizations applied: {', '.join(optimizations)}")
    
    return query

# Pattern 3: Chunked data loading
@st.cache_data(ttl=1800)
def load_large_dataset_chunked(table_name, chunk_size=10000, max_rows=None):
    """Load large datasets in chunks to manage memory"""
    
    connection = get_database_connection()
    
    # Simulate chunked loading
    chunks = []
    offset = 0
    total_loaded = 0
    
    progress_bar = st.progress(0)
    status_text = st.empty()
    
    while True:
        # Construct chunked query
        query = f"SELECT * FROM {table_name} OFFSET {offset} LIMIT {chunk_size}"
        
        # Simulate chunk loading
        chunk_data = connection.execute_query(query)
        
        # Simulate chunk size (in real app, this would be actual data)
        simulated_chunk_size = min(chunk_size, (max_rows or 100000) - total_loaded)
        
        if simulated_chunk_size <= 0:
            break
        
        chunks.append(f"Chunk {len(chunks) + 1}: {simulated_chunk_size} rows")
        total_loaded += simulated_chunk_size
        offset += chunk_size
        
        # Update progress
        if max_rows:
            progress = total_loaded / max_rows
            progress_bar.progress(min(progress, 1.0))
            status_text.text(f"Loaded {total_loaded:,} of {max_rows:,} rows")
        
        # Break if chunk was smaller than expected (end of data)
        if simulated_chunk_size < chunk_size:
            break
    
    progress_bar.empty()
    status_text.empty()
    
    return {
        'chunks': chunks,
        'total_rows': total_loaded,
        'chunk_count': len(chunks)
    }

print("✅ Database optimization patterns defined")

## 4. Memory Management and Optimization

### Memory-Efficient Data Handling

In [None]:
# Memory optimization patterns

def optimize_dataframe_memory(df):
    """Optimize DataFrame memory usage by adjusting data types"""
    
    memory_before = df.memory_usage(deep=True).sum() / 1024 / 1024  # MB
    
    # Optimize numeric columns
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    
    # Optimize string columns
    for col in df.select_dtypes(include=['object']).columns:
        num_unique = df[col].nunique()
        num_total = len(df)
        
        # If less than 50% unique values, convert to category
        if num_unique / num_total < 0.5:
            df[col] = df[col].astype('category')
    
    memory_after = df.memory_usage(deep=True).sum() / 1024 / 1024  # MB
    reduction = (memory_before - memory_after) / memory_before * 100
    
    return df, {
        'memory_before_mb': memory_before,
        'memory_after_mb': memory_after,
        'reduction_percent': reduction
    }

# Pattern: Data sampling for large datasets
def smart_data_sampling(df, max_rows=50000, sample_method='random'):
    """Intelligently sample data for performance while maintaining insights"""
    
    if len(df) <= max_rows:
        return df, {'sampled': False, 'original_rows': len(df)}
    
    if sample_method == 'random':
        sampled_df = df.sample(n=max_rows, random_state=42)
    elif sample_method == 'stratified' and 'category_column' in df.columns:
        # Stratified sampling to maintain category proportions
        sampled_df = df.groupby('category_column', group_keys=False).apply(
            lambda x: x.sample(min(len(x), max_rows // df['category_column'].nunique()))
        )
    else:
        # Time-based sampling (if datetime column exists)
        sampled_df = df.iloc[::len(df)//max_rows]
    
    return sampled_df, {
        'sampled': True,
        'original_rows': len(df),
        'sampled_rows': len(sampled_df),
        'sample_ratio': len(sampled_df) / len(df)
    }

# Pattern: Lazy data loading
class LazyDataLoader:
    """Load data only when needed"""
    
    def __init__(self, data_source):
        self.data_source = data_source
        self._data = None
        self._metadata = None
    
    @property
    def data(self):
        if self._data is None:
            print(f"Loading data from {self.data_source}...")
            # In real app, this would load from file/database
            self._data = self._load_data()
        return self._data
    
    @property
    def metadata(self):
        if self._metadata is None:
            # Load just metadata without full data
            self._metadata = self._load_metadata()
        return self._metadata
    
    def _load_data(self):
        # Simulate data loading
        time.sleep(1)
        return pd.DataFrame({
            'id': range(1000),
            'value': np.random.randn(1000),
            'category': np.random.choice(['A', 'B', 'C'], 1000)
        })
    
    def _load_metadata(self):
        # Quick metadata load
        return {
            'source': self.data_source,
            'estimated_rows': 1000,
            'last_updated': datetime.now() - timedelta(hours=2)
        }

print("✅ Memory optimization patterns defined")

## 5. Progressive Loading and User Feedback

### Creating Responsive User Experiences

In [None]:
%%writefile progressive_loading_app.py
import streamlit as st
import pandas as pd
import numpy as np
import time
import threading
from datetime import datetime
import plotly.express as px

st.set_page_config(
    page_title="Progressive Loading Demo",
    page_icon="⚡",
    layout="wide"
)

# Progressive loading patterns
def create_loading_state():
    """Create sophisticated loading indicators"""
    
    # Multi-stage loading with progress
    stages = [
        ("Connecting to database", 1),
        ("Loading customer data", 3),
        ("Processing orders", 2),
        ("Calculating metrics", 1),
        ("Generating visualizations", 2)
    ]
    
    progress_bar = st.progress(0)
    status_text = st.empty()
    
    total_time = sum(stage[1] for stage in stages)
    elapsed_time = 0
    
    for stage_name, stage_duration in stages:
        status_text.text(f"🔄 {stage_name}...")
        
        # Simulate stage progress
        for i in range(stage_duration * 10):
            time.sleep(0.1)
            elapsed_time += 0.1
            progress = elapsed_time / total_time
            progress_bar.progress(progress)
    
    status_text.text("✅ Complete!")
    time.sleep(0.5)
    
    progress_bar.empty()
    status_text.empty()

@st.cache_data
def load_data_with_feedback(dataset_size):
    """Load data with user feedback during the process"""
    
    # Show immediate feedback
    with st.spinner(f"Generating dataset with {dataset_size:,} records..."):
        
        # Create sample data in chunks to show progress
        chunk_size = min(10000, dataset_size // 10)
        chunks = []
        
        progress_bar = st.progress(0)
        status_container = st.empty()
        
        for i in range(0, dataset_size, chunk_size):
            current_chunk_size = min(chunk_size, dataset_size - i)
            
            # Generate chunk
            chunk = pd.DataFrame({
                'id': range(i, i + current_chunk_size),
                'timestamp': pd.date_range('2023-01-01', periods=current_chunk_size, freq='1H'),
                'value': np.random.randn(current_chunk_size),
                'category': np.random.choice(['A', 'B', 'C', 'D'], current_chunk_size),
                'region': np.random.choice(['North', 'South', 'East', 'West'], current_chunk_size)
            })
            
            chunks.append(chunk)
            
            # Update progress
            progress = (i + current_chunk_size) / dataset_size
            progress_bar.progress(progress)
            status_container.text(f"Generated {i + current_chunk_size:,} of {dataset_size:,} records")
            
            # Small delay to show progress
            time.sleep(0.1)
        
        # Combine chunks
        status_container.text("Combining data chunks...")
        result = pd.concat(chunks, ignore_index=True)
        
        progress_bar.empty()
        status_container.empty()
        
        return result

def create_interactive_dashboard():
    """Create dashboard with progressive disclosure"""
    
    st.title("⚡ Progressive Loading Dashboard")
    
    # Level 1: Quick overview (loads immediately)
    st.subheader("📊 Quick Overview")
    
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        st.metric("Total Records", "1,234,567", "↑ 12%")
    with col2:
        st.metric("Active Users", "45,678", "↑ 8%")
    with col3:
        st.metric("Revenue", "$123,456", "↑ 15%")
    with col4:
        st.metric("Conversion Rate", "3.4%", "↓ 0.2%")
    
    # Level 2: Interactive analysis (loads on demand)
    st.subheader("🔍 Detailed Analysis")
    
    with st.expander("Load Detailed Data Analysis", expanded=False):
        dataset_size = st.selectbox(
            "Choose dataset size:",
            [1000, 10000, 50000, 100000],
            index=1
        )
        
        if st.button("Load Analysis Data", type="primary"):
            # Show loading process
            data = load_data_with_feedback(dataset_size)
            
            # Display results
            st.success(f"✅ Loaded {len(data):,} records successfully!")
            
            # Quick statistics
            col1, col2 = st.columns(2)
            
            with col1:
                st.subheader("Data Overview")
                st.dataframe(data.describe())
            
            with col2:
                st.subheader("Category Distribution")
                category_counts = data['category'].value_counts()
                fig = px.pie(values=category_counts.values, names=category_counts.index)
                st.plotly_chart(fig, use_container_width=True)
            
            # Time series (sampled for performance)
            st.subheader("Value Trends Over Time")
            
            if len(data) > 5000:
                st.info(f"📊 Showing sample of {min(5000, len(data)):,} points for performance")
                plot_data = data.sample(n=min(5000, len(data)))
            else:
                plot_data = data
            
            fig_line = px.line(plot_data, x='timestamp', y='value', color='category')
            st.plotly_chart(fig_line, use_container_width=True)
    
    # Level 3: Advanced features (loads on specific request)
    st.subheader("🚀 Advanced Features")
    
    advanced_feature = st.selectbox(
        "Choose advanced analysis:",
        ["None", "Statistical Analysis", "Correlation Matrix", "Anomaly Detection"]
    )
    
    if advanced_feature != "None":
        with st.spinner(f"Running {advanced_feature.lower()}..."):
            time.sleep(2)  # Simulate complex calculation
            st.success(f"✅ {advanced_feature} completed!")
            
            if advanced_feature == "Statistical Analysis":
                st.write("Statistical analysis results would appear here...")
            elif advanced_feature == "Correlation Matrix":
                st.write("Correlation matrix visualization would appear here...")
            elif advanced_feature == "Anomaly Detection":
                st.write("Anomaly detection results would appear here...")

def show_performance_monitoring():
    """Show real-time performance monitoring"""
    
    st.sidebar.markdown("---")
    st.sidebar.subheader("⚡ Performance Monitor")
    
    # Memory usage
    try:
        import psutil
        process = psutil.Process()
        memory_mb = process.memory_info().rss / 1024 / 1024
        st.sidebar.metric("Memory Usage", f"{memory_mb:.1f} MB")
    except ImportError:
        st.sidebar.info("Install psutil for memory monitoring")
    
    # Cache status
    if st.sidebar.button("Clear Cache"):
        st.cache_data.clear()
        st.sidebar.success("Cache cleared!")
    
    # Performance tips
    with st.sidebar.expander("Performance Tips"):
        st.write("""
        - Use smaller datasets for exploration
        - Enable caching for repeated operations
        - Sample large datasets for visualization
        - Load data progressively
        """)

def main():
    # Performance monitoring
    show_performance_monitoring()
    
    # Main dashboard
    create_interactive_dashboard()
    
    # Demo: Progressive loading
    st.markdown("---")
    st.subheader("🎯 Progressive Loading Demo")
    
    if st.button("Demo: Multi-Stage Loading"):
        create_loading_state()
        st.success("Multi-stage loading complete!")

if __name__ == "__main__":
    main()

## 6. Real-Time Data Updates and Monitoring

### Handling Live Data Efficiently

In [None]:
# Real-time data update patterns

import asyncio
from datetime import datetime, timedelta

class DataStreamManager:
    """Manage real-time data streams efficiently"""
    
    def __init__(self, update_interval=30):
        self.update_interval = update_interval  # seconds
        self.last_update = None
        self.data_cache = {}
        self.subscribers = []
    
    def should_update(self):
        """Check if data should be updated"""
        if self.last_update is None:
            return True
        
        time_since_update = (datetime.now() - self.last_update).total_seconds()
        return time_since_update >= self.update_interval
    
    def get_real_time_data(self, data_type):
        """Get real-time data with intelligent caching"""
        
        cache_key = f"{data_type}_{datetime.now().strftime('%Y%m%d%H%M')}"
        
        if cache_key in self.data_cache and not self.should_update():
            return self.data_cache[cache_key]
        
        # Simulate real-time data fetch
        print(f"Fetching real-time {data_type} data...")
        
        # Simulate API call or database query
        new_data = {
            'timestamp': datetime.now(),
            'data_type': data_type,
            'value': np.random.randint(100, 1000),
            'trend': np.random.choice(['up', 'down', 'stable'])
        }
        
        self.data_cache[cache_key] = new_data
        self.last_update = datetime.now()
        
        # Notify subscribers
        self.notify_subscribers(new_data)
        
        return new_data
    
    def notify_subscribers(self, data):
        """Notify all subscribers of data updates"""
        for callback in self.subscribers:
            try:
                callback(data)
            except Exception as e:
                print(f"Error notifying subscriber: {e}")
    
    def subscribe(self, callback):
        """Subscribe to data updates"""
        self.subscribers.append(callback)

# Pattern: Incremental data updates
@st.cache_data(ttl=60)  # Cache for 1 minute
def get_incremental_updates(last_update_time=None):
    """Get only new data since last update"""
    
    if last_update_time is None:
        last_update_time = datetime.now() - timedelta(hours=24)
    
    # Simulate incremental data query
    # In real app: SELECT * FROM table WHERE updated_at > last_update_time
    
    new_records = pd.DataFrame({
        'id': range(100),
        'timestamp': pd.date_range(last_update_time, periods=100, freq='1T'),
        'value': np.random.randn(100),
        'status': np.random.choice(['active', 'inactive'], 100)
    })
    
    return new_records

# Pattern: Smart refresh strategies
def create_smart_refresh_button():
    """Create intelligent refresh controls"""
    
    col1, col2, col3 = st.columns([1, 1, 2])
    
    with col1:
        if st.button("🔄 Refresh Now"):
            st.cache_data.clear()
            st.experimental_rerun()
    
    with col2:
        auto_refresh = st.checkbox("Auto-refresh", value=False)
    
    with col3:
        if auto_refresh:
            refresh_interval = st.selectbox(
                "Refresh every:",
                [30, 60, 300, 600],  # seconds
                format_func=lambda x: f"{x//60}m {x%60}s" if x >= 60 else f"{x}s"
            )
            
            # Auto-refresh using session state
            if 'last_refresh' not in st.session_state:
                st.session_state.last_refresh = datetime.now()
            
            time_since_refresh = (datetime.now() - st.session_state.last_refresh).total_seconds()
            
            if time_since_refresh >= refresh_interval:
                st.session_state.last_refresh = datetime.now()
                st.experimental_rerun()
            
            # Show countdown
            remaining = refresh_interval - time_since_refresh
            st.caption(f"Next refresh in {remaining:.0f}s")

print("✅ Real-time data patterns defined")

## 7. Performance Monitoring and Debugging

### Built-in Performance Monitoring

In [None]:
# Performance monitoring and debugging tools

import logging
import traceback
from contextlib import contextmanager

class PerformanceMonitor:
    """Monitor and log application performance"""
    
    def __init__(self):
        self.metrics = {
            'function_calls': {},
            'execution_times': {},
            'memory_usage': {},
            'cache_hits': 0,
            'cache_misses': 0
        }
        self.start_time = datetime.now()
    
    @contextmanager
    def measure_execution(self, operation_name):
        """Context manager to measure execution time"""
        start_time = time.time()
        start_memory = get_memory_usage()
        
        try:
            yield
        finally:
            end_time = time.time()
            end_memory = get_memory_usage()
            
            execution_time = end_time - start_time
            memory_delta = end_memory - start_memory
            
            # Record metrics
            self.record_execution(operation_name, execution_time, memory_delta)
    
    def record_execution(self, operation, execution_time, memory_delta):
        """Record execution metrics"""
        if operation not in self.metrics['function_calls']:
            self.metrics['function_calls'][operation] = 0
            self.metrics['execution_times'][operation] = []
            self.metrics['memory_usage'][operation] = []
        
        self.metrics['function_calls'][operation] += 1
        self.metrics['execution_times'][operation].append(execution_time)
        self.metrics['memory_usage'][operation].append(memory_delta)
    
    def get_performance_summary(self):
        """Get performance summary"""
        summary = {}
        
        for operation in self.metrics['function_calls']:
            times = self.metrics['execution_times'][operation]
            memory = self.metrics['memory_usage'][operation]
            
            summary[operation] = {
                'calls': self.metrics['function_calls'][operation],
                'avg_time': np.mean(times),
                'max_time': np.max(times),
                'total_time': np.sum(times),
                'avg_memory': np.mean(memory),
                'max_memory': np.max(memory)
            }
        
        return summary
    
    def display_metrics(self):
        """Display performance metrics in Streamlit"""
        st.subheader("📊 Performance Metrics")
        
        summary = self.get_performance_summary()
        
        if not summary:
            st.info("No performance data available yet")
            return
        
        # Overall metrics
        col1, col2, col3 = st.columns(3)
        
        with col1:
            total_calls = sum(data['calls'] for data in summary.values())
            st.metric("Total Function Calls", total_calls)
        
        with col2:
            total_time = sum(data['total_time'] for data in summary.values())
            st.metric("Total Execution Time", f"{total_time:.2f}s")
        
        with col3:
            cache_hit_rate = self.metrics['cache_hits'] / max(1, self.metrics['cache_hits'] + self.metrics['cache_misses'])
            st.metric("Cache Hit Rate", f"{cache_hit_rate:.1%}")
        
        # Detailed breakdown
        st.subheader("Function Performance Breakdown")
        
        performance_df = pd.DataFrame([
            {
                'Function': func,
                'Calls': data['calls'],
                'Avg Time (s)': round(data['avg_time'], 3),
                'Max Time (s)': round(data['max_time'], 3),
                'Total Time (s)': round(data['total_time'], 3),
                'Avg Memory (MB)': round(data['avg_memory'], 2)
            }
            for func, data in summary.items()
        ])
        
        st.dataframe(performance_df)
        
        # Performance visualization
        if len(performance_df) > 0:
            fig = px.bar(
                performance_df,
                x='Function',
                y='Total Time (s)',
                title="Total Execution Time by Function"
            )
            st.plotly_chart(fig, use_container_width=True)

# Global performance monitor
if 'perf_monitor' not in st.session_state:
    st.session_state.perf_monitor = PerformanceMonitor()

# Decorator for automatic performance monitoring
def monitor_performance(operation_name=None):
    """Decorator to automatically monitor function performance"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            name = operation_name or func.__name__
            
            with st.session_state.perf_monitor.measure_execution(name):
                return func(*args, **kwargs)
        
        return wrapper
    return decorator

# Example usage of performance monitoring
@monitor_performance("data_loading")
def load_monitored_data(size=1000):
    """Example function with performance monitoring"""
    time.sleep(0.5)  # Simulate slow operation
    return pd.DataFrame({
        'id': range(size),
        'value': np.random.randn(size)
    })

@monitor_performance("data_processing")
def process_monitored_data(df):
    """Example processing function with monitoring"""
    time.sleep(0.2)  # Simulate processing time
    return df.groupby(df.index // 100).agg({
        'value': ['mean', 'std', 'count']
    })

print("✅ Performance monitoring tools defined")

### Debugging and Error Handling

In [None]:
# Debugging and error handling patterns

def create_debug_panel():
    """Create a debug panel for development"""
    
    with st.sidebar.expander("🐛 Debug Panel", expanded=False):
        st.subheader("Debug Information")
        
        # Session state inspection
        if st.checkbox("Show Session State"):
            st.json(dict(st.session_state))
        
        # Cache information
        if st.checkbox("Show Cache Info"):
            st.write("Cache functions available:")
            st.code("st.cache_data.clear() - Clear data cache")
            st.code("st.cache_resource.clear() - Clear resource cache")
        
        # Performance metrics
        if st.checkbox("Show Performance Metrics"):
            if 'perf_monitor' in st.session_state:
                st.session_state.perf_monitor.display_metrics()
        
        # Memory usage
        if st.checkbox("Show Memory Usage"):
            try:
                memory_mb = get_memory_usage()
                st.metric("Current Memory Usage", f"{memory_mb:.1f} MB")
            except:
                st.error("Unable to get memory usage")
        
        # Error simulation for testing
        st.subheader("Error Testing")
        
        error_types = [
            "None",
            "ValueError",
            "KeyError",
            "Memory Error",
            "Network Error"
        ]
        
        selected_error = st.selectbox("Simulate Error:", error_types)
        
        if selected_error != "None" and st.button("Trigger Error"):
            simulate_error(selected_error)

def simulate_error(error_type):
    """Simulate different types of errors for testing"""
    try:
        if error_type == "ValueError":
            raise ValueError("Simulated value error for testing")
        elif error_type == "KeyError":
            test_dict = {'a': 1}
            _ = test_dict['nonexistent_key']
        elif error_type == "Memory Error":
            # Simulate memory error (don't actually consume memory)
            raise MemoryError("Simulated memory error")
        elif error_type == "Network Error":
            raise ConnectionError("Simulated network connection error")
    except Exception as e:
        handle_application_error(e, error_type)

def handle_application_error(error, context=""):
    """Centralized error handling with user-friendly messages"""
    
    error_type = type(error).__name__
    error_message = str(error)
    
    # Log the error (in production, use proper logging)
    print(f"ERROR [{error_type}] in {context}: {error_message}")
    print(f"Traceback: {traceback.format_exc()}")
    
    # User-friendly error messages
    if "Memory" in error_type:
        st.error("""
        🚨 **Memory Error**
        
        The application ran out of memory. Try:
        - Using a smaller dataset
        - Clearing the cache
        - Refreshing the page
        """)
        
        if st.button("Clear Cache and Refresh"):
            st.cache_data.clear()
            st.experimental_rerun()
    
    elif "Connection" in error_type or "Network" in error_type:
        st.error("""
        🌐 **Connection Error**
        
        Unable to connect to data source. Try:
        - Checking your internet connection
        - Refreshing the page
        - Contacting support if the issue persists
        """)
        
        if st.button("Retry Connection"):
            st.experimental_rerun()
    
    elif "KeyError" in error_type:
        st.error("""
        🔑 **Data Error**
        
        Missing expected data fields. This might be due to:
        - Data source changes
        - Invalid filter selections
        - Temporary data issues
        """)
    
    else:
        st.error(f"""
        ⚠️ **Application Error**
        
        An unexpected error occurred: {error_type}
        
        Please try refreshing the page or contact support.
        """)
    
    # Show technical details in expandable section
    with st.expander("Technical Details (for developers)"):
        st.code(f"Error Type: {error_type}")
        st.code(f"Error Message: {error_message}")
        st.code(f"Context: {context}")
        st.code(f"Timestamp: {datetime.now()}")

# Robust function wrapper
def make_robust(func, fallback_value=None, error_message="Operation failed"):
    """Make any function more robust with error handling"""
    
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            handle_application_error(e, func.__name__)
            return fallback_value
    
    return wrapper

# Example usage
@make_robust(fallback_value=pd.DataFrame(), error_message="Failed to load data")
def robust_data_loader(source):
    """Data loader with built-in error handling"""
    if source == "error":
        raise ValueError("Invalid data source")
    return pd.DataFrame({'test': [1, 2, 3]})

print("✅ Debugging and error handling patterns defined")

## 8. 🛠️ Performance Optimization Exercise

**Challenge**: Optimize a slow Streamlit application

### The Problem
You have a Streamlit app that:
- Takes 30+ seconds to load
- Uses 2GB+ of memory
- Crashes with large datasets
- Provides no feedback during loading

### Your Mission
Apply performance optimization techniques to make it fast and reliable.

In [None]:
# 🛠️ Performance Optimization Exercise
# 
# BEFORE: Slow, inefficient app
# AFTER: Fast, optimized app
#
# Your task: Apply the optimization patterns we've learned

%%writefile slow_app.py
import streamlit as st
import pandas as pd
import numpy as np
import time

# ❌ SLOW VERSION - DON'T USE IN PRODUCTION!
def slow_data_loader():
    """Inefficient data loading - loads everything every time"""
    print("Loading data... (this is slow!)")
    time.sleep(5)  # Simulate slow database
    
    # Create large dataset in memory
    large_data = pd.DataFrame({
        'id': range(1000000),  # 1 million rows
        'timestamp': pd.date_range('2020-01-01', periods=1000000, freq='1min'),
        'value': np.random.randn(1000000),
        'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
        'text_data': ['sample_text_' + str(i) for i in range(1000000)]
    })
    
    return large_data

def slow_processing(data):
    """Inefficient data processing"""
    print("Processing data... (this is also slow!)")
    time.sleep(3)
    
    # Inefficient operations
    results = []
    for category in data['category'].unique():
        subset = data[data['category'] == category]
        result = {
            'category': category,
            'mean': subset['value'].mean(),
            'std': subset['value'].std(),
            'count': len(subset)
        }
        results.append(result)
    
    return pd.DataFrame(results)

def slow_main():
    """Slow version of the app"""
    st.title("❌ Slow App - Don't Use This!")
    
    # No caching - loads data every time
    data = slow_data_loader()
    
    # No progress feedback
    processed = slow_processing(data)
    
    # Display everything at once
    st.subheader("Raw Data")
    st.dataframe(data)  # Will crash browser with 1M rows
    
    st.subheader("Processed Results")
    st.dataframe(processed)

if __name__ == "__main__":
    slow_main()

print("❌ Slow app created - now optimize it!")

In [None]:
# ✅ OPTIMIZED VERSION - Use these patterns!

%%writefile optimized_app.py
import streamlit as st
import pandas as pd
import numpy as np
import time
from datetime import datetime

st.set_page_config(
    page_title="Optimized Performance App",
    page_icon="⚡",
    layout="wide"
)

# ✅ OPTIMIZATION 1: Caching with appropriate TTL
@st.cache_data(ttl=3600)  # Cache for 1 hour
def optimized_data_loader(sample_size=50000):
    """Efficient data loading with sampling and caching"""
    
    with st.spinner(f"Loading optimized dataset ({sample_size:,} records)..."):
        time.sleep(1)  # Simulate database query
        
        # Create manageable dataset
        data = pd.DataFrame({
            'id': range(sample_size),
            'timestamp': pd.date_range('2020-01-01', periods=sample_size, freq='1H'),
            'value': np.random.randn(sample_size),
            'category': np.random.choice(['A', 'B', 'C', 'D'], sample_size)
        })
        
        # ✅ OPTIMIZATION 2: Memory optimization
        data['category'] = data['category'].astype('category')
        data['value'] = data['value'].astype('float32')  # Use less memory
        
    return data

# ✅ OPTIMIZATION 3: Efficient processing with caching
@st.cache_data(ttl=1800)  # Cache for 30 minutes
def optimized_processing(data):
    """Efficient data processing using pandas operations"""
    
    with st.spinner("Processing data efficiently..."):
        # Use pandas groupby - much faster than loops
        processed = data.groupby('category')['value'].agg([
            'mean', 'std', 'count', 'min', 'max'
        ]).round(3)
        
        processed = processed.reset_index()
    
    return processed

# ✅ OPTIMIZATION 4: Progressive disclosure
def create_optimized_dashboard():
    """Create dashboard with progressive loading"""
    
    st.title("⚡ Optimized Performance App")
    
    # Quick overview (loads immediately)
    st.subheader("📊 Quick Overview")
    
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        st.metric("Status", "✅ Optimized", "90% faster")
    with col2:
        st.metric("Memory Usage", "~50MB", "-95%")
    with col3:
        st.metric("Load Time", "<3s", "-90%")
    with col4:
        st.metric("Cache Hit Rate", "85%", "+85%")
    
    # ✅ OPTIMIZATION 5: User controls for data size
    st.subheader("🎛️ Data Controls")
    
    sample_size = st.selectbox(
        "Choose dataset size:",
        [1000, 10000, 50000, 100000],
        index=2,
        help="Larger datasets take more time and memory"
    )
    
    # ✅ OPTIMIZATION 6: Load data only when needed
    if st.button("Load Data", type="primary"):
        
        # Show performance monitoring
        start_time = time.time()
        
        # Load data with caching
        data = optimized_data_loader(sample_size)
        
        load_time = time.time() - start_time
        
        st.success(f"✅ Data loaded in {load_time:.2f} seconds!")
        
        # ✅ OPTIMIZATION 7: Show sample instead of full data
        st.subheader("📋 Data Sample (First 100 rows)")
        st.dataframe(data.head(100))
        
        # Show data info
        col1, col2 = st.columns(2)
        with col1:
            st.info(f"Total records: {len(data):,}")
        with col2:
            memory_mb = data.memory_usage(deep=True).sum() / 1024 / 1024
            st.info(f"Memory usage: {memory_mb:.1f} MB")
        
        # ✅ OPTIMIZATION 8: Efficient processing
        with st.expander("📊 Analysis Results", expanded=True):
            processed = optimized_processing(data)
            st.dataframe(processed)
            
            # Simple visualization
            import plotly.express as px
            fig = px.bar(processed, x='category', y='mean', title='Average Value by Category')
            st.plotly_chart(fig, use_container_width=True)
        
        # ✅ OPTIMIZATION 9: Full data access on demand
        if st.checkbox("Show Full Dataset (Use with caution for large data)"):
            if len(data) > 10000:
                st.warning(f"This dataset has {len(data):,} rows. Display may be slow.")
                if st.button("I understand, show full data"):
                    st.dataframe(data)
            else:
                st.dataframe(data)

# ✅ OPTIMIZATION 10: Performance monitoring
def show_performance_panel():
    """Show performance monitoring panel"""
    
    st.sidebar.markdown("---")
    st.sidebar.subheader("⚡ Performance Panel")
    
    # Cache controls
    if st.sidebar.button("Clear Cache"):
        st.cache_data.clear()
        st.sidebar.success("Cache cleared!")
    
    # Memory monitoring
    try:
        import psutil
        process = psutil.Process()
        memory_mb = process.memory_info().rss / 1024 / 1024
        st.sidebar.metric("Memory Usage", f"{memory_mb:.1f} MB")
    except ImportError:
        st.sidebar.info("Install psutil for memory monitoring")
    
    # Performance tips
    with st.sidebar.expander("💡 Optimization Tips"):
        st.write("""
        **Applied Optimizations:**
        ✅ Data caching with TTL
        ✅ Memory-efficient data types
        ✅ Progressive data loading
        ✅ Smart sampling
        ✅ User feedback during operations
        ✅ Chunked processing
        ✅ Error handling
        """)

def main():
    """Optimized main application"""
    
    # Performance monitoring
    show_performance_panel()
    
    # Main dashboard
    create_optimized_dashboard()
    
    # Show comparison
    st.markdown("---")
    st.subheader("🔄 Before vs After Comparison")
    
    col1, col2 = st.columns(2)
    
    with col1:
        st.markdown("""
        **❌ Before Optimization:**
        - Load time: 30+ seconds
        - Memory usage: 2GB+
        - No user feedback
        - Browser crashes
        - No caching
        """)
    
    with col2:
        st.markdown("""
        **✅ After Optimization:**
        - Load time: <3 seconds
        - Memory usage: ~50MB
        - Progress indicators
        - Stable performance
        - Smart caching
        """)

if __name__ == "__main__":
    main()

print("✅ Optimized app created!")

## 9. Scaling for Multiple Users

### Multi-User Performance Considerations

When your app serves multiple business users, additional considerations apply:

#### Resource Sharing
```python
# ✅ Good: Shared resources
@st.cache_resource
def get_shared_connection():
    """Database connection shared across all users"""
    return create_db_connection()

# ✅ Good: User-specific data caching
@st.cache_data
def get_user_data(user_id, filters):
    """Cache data per user and filter combination"""
    return load_filtered_data(user_id, filters)
```

#### Memory Management for Multiple Users
```python
# Set memory limits per user session
MAX_MEMORY_PER_USER = 500  # MB

def check_memory_limit():
    """Check if user session is using too much memory"""
    current_memory = get_memory_usage()
    
    if current_memory > MAX_MEMORY_PER_USER:
        st.warning(f"Memory usage ({current_memory:.1f}MB) approaching limit. Consider clearing cache or using smaller datasets.")
```

#### Load Balancing Strategies
```python
# Distribute expensive operations
def balance_computational_load(operation_type):
    """Distribute heavy computations across time"""
    
    # Check current system load
    current_hour = datetime.now().hour
    
    if operation_type == 'heavy_analysis' and 9 <= current_hour <= 17:
        st.info("⏰ Heavy analysis during business hours may be slower. Consider running during off-peak times.")
    
    # Queue non-urgent operations
    if operation_type == 'report_generation':
        st.info("📊 Report queued for generation. You'll receive an email when ready.")
        return queue_report_generation()
```

## Summary and Next Steps

### What We Learned
- **Advanced Caching**: Hierarchical caching strategies for different data types and update frequencies
- **Database Optimization**: Connection pooling, query optimization, and chunked loading
- **Memory Management**: Data type optimization, smart sampling, and lazy loading
- **Progressive Loading**: User feedback, loading indicators, and progressive disclosure
- **Real-Time Updates**: Efficient handling of live data with smart refresh strategies
- **Performance Monitoring**: Built-in metrics, debugging tools, and error handling
- **Scaling Considerations**: Multi-user resource management and load balancing

### Key Performance Principles

1. **Cache Strategically**: Use appropriate TTL values for different data types
2. **Provide Feedback**: Always show progress during long operations
3. **Sample Intelligently**: Use manageable data sizes for interactive exploration
4. **Handle Errors Gracefully**: Provide user-friendly error messages and recovery options
5. **Monitor Continuously**: Track performance metrics and optimize based on real usage
6. **Scale Thoughtfully**: Consider multi-user impact when designing applications

### Performance Optimization Checklist

**Data Loading**:
- [ ] Implement appropriate caching with TTL
- [ ] Use chunked loading for large datasets
- [ ] Provide loading indicators and progress bars
- [ ] Optimize data types for memory efficiency

**User Experience**:
- [ ] Show immediate feedback for user actions
- [ ] Implement progressive disclosure of information
- [ ] Provide controls for dataset size and complexity
- [ ] Handle errors with user-friendly messages

**Monitoring & Debugging**:
- [ ] Add performance monitoring to track metrics
- [ ] Include debug panels for development
- [ ] Implement robust error handling
- [ ] Monitor memory usage and cache performance

**Scalability**:
- [ ] Test with multiple concurrent users
- [ ] Implement resource sharing where appropriate
- [ ] Consider load balancing for heavy operations
- [ ] Set and monitor resource limits

---

### 🎯 Final Week 10 Wednesday Deliverable

You now have a complete toolkit for creating high-performance, business-ready Streamlit applications:

1. **Part 1**: Convert analysis notebooks to structured applications
2. **Part 2**: Design user experiences for business stakeholders
3. **Part 3**: Optimize performance for production deployment

**Next Session**: Thursday's content will focus on deployment, presentation preparation, and final project refinement.

Your applications are now ready for business use! 🚀