# WEEK 8: ADVANCED FILTERING IN PYTHON - PART 3
## Topic: Performance Optimization for Large Datasets
## Business Case: Scaling to Production-Size E-Commerce Data

---

### LEARNING OBJECTIVES:
1. Compare performance of different filtering approaches
2. Use vectorized operations for speed
3. Apply memory-efficient filtering techniques
4. Leverage pandas optimization features
5. Understand query optimization strategies
6. Benchmark and profile filtering operations

### BUSINESS CONTEXT:
When working with production datasets (millions of rows), performance matters! Learn to optimize your filtering code for speed and memory efficiency - essential for real-world analytics.

### PERFORMANCE HIERARCHY (Fastest to Slowest):
1. ✅ Vectorized operations (NumPy/pandas)
2. ✅ `.query()` method
3. ✅ Boolean indexing with single mask
4. ⚠️ `.apply()` with lambda
5. ❌ Python loops (AVOID!)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import time
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print('Libraries imported successfully!')
print(f'Pandas version: {pd.__version__}')
print(f'NumPy version: {np.__version__}')

## Section 1: Create Large Sample Dataset

We'll simulate a large e-commerce dataset to test performance optimizations.

In [None]:
# Generate large sample dataset (100,000 records)
np.random.seed(42)
n_rows = 100000

print(f"Generating dataset with {n_rows:,} rows...")

large_dataset = pd.DataFrame({
    'order_id': [f'order_{i:06d}' for i in range(n_rows)],
    'customer_id': np.random.choice([f'cust_{i:05d}' for i in range(10000)], n_rows),
    'customer_state': np.random.choice(
        ['SP', 'RJ', 'MG', 'BA', 'PE', 'RS', 'PR', 'SC', 'CE', 'GO'], 
        n_rows, 
        p=[0.35, 0.2, 0.15, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.02]
    ),
    'order_value': np.random.uniform(10, 1000, n_rows).round(2),
    'freight_value': np.random.uniform(5, 100, n_rows).round(2),
    'review_score': np.random.choice(
        [1.0, 2.0, 3.0, 4.0, 5.0, np.nan], 
        n_rows, 
        p=[0.05, 0.1, 0.15, 0.3, 0.35, 0.05]
    ),
    'order_status': np.random.choice(
        ['delivered', 'shipped', 'canceled', 'processing'], 
        n_rows, 
        p=[0.75, 0.15, 0.05, 0.05]
    ),
    'payment_type': np.random.choice(
        ['credit_card', 'boleto', 'voucher', 'debit_card'], 
        n_rows, 
        p=[0.6, 0.25, 0.1, 0.05]
    ),
    'payment_installments': np.random.choice([1, 2, 3, 4, 6, 10, 12], n_rows)
})

# Add calculated columns
large_dataset['total_value'] = large_dataset['order_value'] + large_dataset['freight_value']
large_dataset['order_date'] = pd.date_range('2017-01-01', periods=n_rows, freq='5min')

print(f"\n✅ Dataset created: {len(large_dataset):,} rows")
print(f"Memory usage: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nFirst few rows:")
print(large_dataset.head())

In [None]:
# Check dataset info
print("Dataset Information:")
print(large_dataset.info())
print("\nSummary Statistics:")
print(large_dataset.describe())

## Section 2: Performance Comparison - Filtering Methods

Let's benchmark different filtering approaches on our large dataset.

In [None]:
# Create benchmark utility function
def benchmark_filter(df, method_name, filter_func, runs=5):
    """
    Benchmark a filtering function.
    
    Parameters:
    - df: DataFrame to filter
    - method_name: Name of the method for reporting
    - filter_func: Function that takes df and returns filtered df
    - runs: Number of times to run for averaging
    """
    times = []
    result_size = 0
    
    for i in range(runs):
        start_time = time.perf_counter()
        result = filter_func(df)
        end_time = time.perf_counter()
        times.append((end_time - start_time) * 1000)  # Convert to milliseconds
        result_size = len(result)
    
    avg_time = np.mean(times)
    std_time = np.std(times)
    
    return {
        'Method': method_name,
        'Avg Time (ms)': round(avg_time, 2),
        'Std Dev (ms)': round(std_time, 2),
        'Result Rows': result_size
    }

print("Benchmark function ready!")

### Test Case: High-Value Orders from Major States with Good Reviews

Filter criteria:
- Total value > R$ 200
- Customer state in ['SP', 'RJ', 'MG']
- Review score >= 4
- Order status = 'delivered'

In [None]:
# Define filtering methods
benchmark_results = []

# Method 1: Boolean indexing (standard approach)
def method1_boolean_indexing(df):
    return df[
        (df['total_value'] > 200) &
        (df['customer_state'].isin(['SP', 'RJ', 'MG'])) &
        (df['review_score'] >= 4) &
        (df['order_status'] == 'delivered')
    ]

# Method 2: .query() method
def method2_query(df):
    return df.query(
        "total_value > 200 and "
        "customer_state in ['SP', 'RJ', 'MG'] and "
        "review_score >= 4 and "
        "order_status == 'delivered'"
    )

# Method 3: Multiple sequential filters (INEFFICIENT)
def method3_sequential(df):
    temp1 = df[df['total_value'] > 200]
    temp2 = temp1[temp1['customer_state'].isin(['SP', 'RJ', 'MG'])]
    temp3 = temp2[temp2['review_score'] >= 4]
    return temp3[temp3['order_status'] == 'delivered']

# Method 4: Using .loc with boolean mask
def method4_loc(df):
    mask = (
        (df['total_value'] > 200) &
        (df['customer_state'].isin(['SP', 'RJ', 'MG'])) &
        (df['review_score'] >= 4) &
        (df['order_status'] == 'delivered')
    )
    return df.loc[mask]

# Method 5: Filter in stages (optimized order)
def method5_staged(df):
    # Filter most selective first
    df_high_value = df[df['total_value'] > 200]
    df_delivered = df_high_value[df_high_value['order_status'] == 'delivered']
    df_states = df_delivered[df_delivered['customer_state'].isin(['SP', 'RJ', 'MG'])]
    return df_states[df_states['review_score'] >= 4]

# Run all benchmarks
print("Running benchmarks... (this may take a moment)\n")
benchmark_results.append(benchmark_filter(large_dataset, 'Boolean Indexing', method1_boolean_indexing))
benchmark_results.append(benchmark_filter(large_dataset, '.query() Method', method2_query))
benchmark_results.append(benchmark_filter(large_dataset, 'Sequential Filters', method3_sequential))
benchmark_results.append(benchmark_filter(large_dataset, '.loc with Mask', method4_loc))
benchmark_results.append(benchmark_filter(large_dataset, 'Staged Filtering', method5_staged))

# Display results
results_df = pd.DataFrame(benchmark_results).sort_values('Avg Time (ms)')
results_df['Relative Speed'] = (results_df['Avg Time (ms)'].min() / results_df['Avg Time (ms)']).round(2)

print("\n" + "="*80)
print("PERFORMANCE COMPARISON - FILTERING METHODS")
print("="*80)
print(results_df.to_string(index=False))
print("\n" + "="*80)
print(f"✅ Fastest Method: {results_df.iloc[0]['Method']}")
print(f"⏱️  Time: {results_df.iloc[0]['Avg Time (ms)']} ms")
print(f"📊 Results: {results_df.iloc[0]['Result Rows']:,} rows")

## Section 3: Memory-Efficient Filtering

Reducing memory usage is crucial for large datasets.

In [None]:
# Compare memory usage of different approaches
import sys

# Memory-inefficient: Creating multiple intermediate dataframes
def memory_check_inefficient(df):
    step1 = df[df['total_value'] > 200]  # Creates copy
    step2 = step1[step1['customer_state'].isin(['SP', 'RJ'])]  # Creates another copy
    step3 = step2[step2['review_score'] >= 4]  # Creates another copy
    return step3

# Memory-efficient: Single boolean mask
def memory_check_efficient(df):
    mask = (
        (df['total_value'] > 200) &
        (df['customer_state'].isin(['SP', 'RJ'])) &
        (df['review_score'] >= 4)
    )
    return df[mask]  # Creates one copy

print("Memory Usage Analysis:")
print("="*60)
print(f"Original dataset: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nInefficient approach:")
print("  - Creates multiple intermediate DataFrames")
print("  - Each intermediate copy uses memory")
print("  - Memory spikes during execution")
print("\nEfficient approach:")
print("  - Creates single boolean mask (very small)")
print("  - Only one final DataFrame copy")
print("  - Minimal memory overhead")
print("\n✅ Best Practice: Use single combined boolean mask")

## Section 4: Data Type Optimization

Optimizing data types can significantly reduce memory usage.

In [None]:
# Check current memory usage
print("ORIGINAL DATA TYPES AND MEMORY USAGE")
print("="*60)
print(large_dataset.dtypes)
print(f"\nTotal memory: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Create optimized version
optimized_dataset = large_dataset.copy()

# Convert strings to categories (huge memory savings!)
optimized_dataset['customer_state'] = optimized_dataset['customer_state'].astype('category')
optimized_dataset['order_status'] = optimized_dataset['order_status'].astype('category')
optimized_dataset['payment_type'] = optimized_dataset['payment_type'].astype('category')

# Convert float64 to float32 where precision isn't critical
optimized_dataset['review_score'] = optimized_dataset['review_score'].astype('float32')
optimized_dataset['order_value'] = optimized_dataset['order_value'].astype('float32')
optimized_dataset['freight_value'] = optimized_dataset['freight_value'].astype('float32')
optimized_dataset['total_value'] = optimized_dataset['total_value'].astype('float32')

# Convert int64 to int32 or int16 where appropriate
optimized_dataset['payment_installments'] = optimized_dataset['payment_installments'].astype('int8')

print("\n\nOPTIMIZED DATA TYPES AND MEMORY USAGE")
print("="*60)
print(optimized_dataset.dtypes)
print(f"\nTotal memory: {optimized_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Calculate savings
original_memory = large_dataset.memory_usage(deep=True).sum() / 1024**2
optimized_memory = optimized_dataset.memory_usage(deep=True).sum() / 1024**2
savings = (1 - optimized_memory / original_memory) * 100

print("\n" + "="*60)
print(f"💾 Memory saved: {savings:.1f}%")
print(f"✅ Reduction: {original_memory - optimized_memory:.2f} MB")

## Section 5: Vectorized Operations vs Apply

Demonstrating why vectorized operations are dramatically faster.

In [None]:
# Create subset for demonstration (apply is VERY slow on large data)
demo_data = large_dataset.head(10000).copy()

# Task: Categorize orders by value
# BAD: Using .apply() with lambda
start = time.perf_counter()
demo_data['category_apply'] = demo_data['total_value'].apply(
    lambda x: 'High' if x > 300 else ('Medium' if x > 150 else 'Low')
)
apply_time = (time.perf_counter() - start) * 1000

# GOOD: Using vectorized operations (pd.cut or np.where)
start = time.perf_counter()
demo_data['category_vectorized'] = pd.cut(
    demo_data['total_value'],
    bins=[0, 150, 300, float('inf')],
    labels=['Low', 'Medium', 'High']
)
vectorized_time = (time.perf_counter() - start) * 1000

# Alternative: Using np.select (also fast)
start = time.perf_counter()
conditions = [
    demo_data['total_value'] > 300,
    demo_data['total_value'] > 150,
    demo_data['total_value'] <= 150
]
choices = ['High', 'Medium', 'Low']
demo_data['category_select'] = np.select(conditions, choices)
select_time = (time.perf_counter() - start) * 1000

print("VECTORIZATION vs APPLY PERFORMANCE")
print("="*60)
print(f"Dataset size: {len(demo_data):,} rows")
print("\nTask: Categorize orders by total value\n")
print(f".apply() with lambda:  {apply_time:.2f} ms")
print(f"pd.cut() vectorized:   {vectorized_time:.2f} ms  ({apply_time/vectorized_time:.1f}x FASTER)")
print(f"np.select() vectorized: {select_time:.2f} ms  ({apply_time/select_time:.1f}x FASTER)")
print("\n" + "="*60)
print("⚠️  WARNING: .apply() can be 10-100x slower than vectorized operations!")
print("✅ ALWAYS try to vectorize your operations instead of using .apply()")

## Section 6: Filter Order Optimization

The order in which you apply filters can impact performance.

In [None]:
# Strategy: Filter most selective conditions FIRST

# Check selectivity of each condition
print("FILTER SELECTIVITY ANALYSIS")
print("="*60)
total_rows = len(large_dataset)

filters = [
    ('total_value > 500', large_dataset['total_value'] > 500),
    ('review_score >= 4', large_dataset['review_score'] >= 4),
    ('customer_state == SP', large_dataset['customer_state'] == 'SP'),
    ('order_status == delivered', large_dataset['order_status'] == 'delivered')
]

selectivity = []
for name, condition in filters:
    rows_remaining = condition.sum()
    selectivity_pct = (rows_remaining / total_rows) * 100
    selectivity.append({
        'Filter': name,
        'Rows Remaining': f"{rows_remaining:,}",
        'Selectivity %': f"{selectivity_pct:.1f}%"
    })

selectivity_df = pd.DataFrame(selectivity)
print(selectivity_df.to_string(index=False))
print("\n" + "="*60)
print("💡 TIP: Apply most selective filters (lowest %) first!")
print("   This reduces the data size early in the pipeline.")

## Section 7: Production-Ready Customer Segmentation Pipeline

Putting it all together: Optimized, fast, memory-efficient customer analysis.

In [None]:
def optimized_customer_segmentation(df, value_threshold=200, review_threshold=4):
    """
    High-performance customer segmentation using best practices.
    
    Returns: Dictionary of customer segments
    """
    # Use optimized dataset if available
    if df['customer_state'].dtype.name != 'category':
        print("⚠️  Warning: Dataset not optimized. Consider optimizing dtypes first.")
    
    # Segment 1: Champions (high value + high satisfaction)
    champions = df.query(
        'total_value >= @value_threshold and '
        'review_score >= @review_threshold and '
        'order_status == "delivered"'
    )
    
    # Segment 2: At Risk (high value but low satisfaction)
    at_risk = df.query(
        'total_value >= @value_threshold and '
        'review_score < 3 and '
        'review_score.notna() and '
        'order_status == "delivered"'
    )
    
    # Segment 3: Potential (medium value, good satisfaction)
    potential = df.query(
        'total_value >= 100 and total_value < @value_threshold and '
        'review_score >= @review_threshold and '
        'order_status == "delivered"'
    )
    
    # Segment 4: Lost (canceled orders)
    lost = df.query('order_status == "canceled"')
    
    return {
        'champions': champions,
        'at_risk': at_risk,
        'potential': potential,
        'lost': lost
    }

# Run optimized segmentation
print("Running optimized customer segmentation...\n")
start = time.perf_counter()
segments = optimized_customer_segmentation(optimized_dataset)
execution_time = (time.perf_counter() - start) * 1000

# Generate summary report
print("\n" + "="*80)
print("CUSTOMER SEGMENTATION DASHBOARD")
print("="*80)
print(f"\n⏱️  Execution time: {execution_time:.2f} ms")
print(f"📊 Total orders analyzed: {len(optimized_dataset):,}\n")

summary_data = []
for segment_name, segment_df in segments.items():
    summary_data.append({
        'Segment': segment_name.title(),
        'Customers': len(segment_df),
        'Total Revenue': f"R$ {segment_df['total_value'].sum():,.2f}",
        'Avg Order Value': f"R$ {segment_df['total_value'].mean():.2f}" if len(segment_df) > 0 else 'N/A',
        'Avg Review': f"{segment_df['review_score'].mean():.2f}" if segment_df['review_score'].notna().any() else 'N/A'
    })

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))
print("\n" + "="*80)
print("\n✅ Segmentation complete and optimized for production use!")

## Section 8: Performance Best Practices Summary

### PERFORMANCE HIERARCHY (Fast → Slow):

1. **🚀 Vectorized Operations** (FASTEST)
   - NumPy/pandas built-in functions
   - Example: `df['col'] > threshold`
   - **10-100x faster than loops**

2. **⚡ `.query()` Method**
   - Good for complex multi-condition filters
   - Readable SQL-like syntax
   - Often comparable to boolean indexing

3. **✅ Boolean Indexing with Single Mask**
   - Combine conditions in one expression
   - Use `.loc[]` for explicit indexing
   - Memory efficient

4. **⚠️ `.apply()` with Lambda** (SLOWER)
   - Avoid when possible
   - Use only when vectorization is impossible
   - 10-50x slower than vectorized operations

5. **❌ Python Loops** (SLOWEST - AVOID!)
   - Never iterate row-by-row
   - Always find vectorized alternative
   - **100x+ slower than vectorized**

### OPTIMIZATION CHECKLIST:

#### Data Type Optimization:
✅ Convert categorical strings to `category` dtype
✅ Use `float32` instead of `float64` where appropriate
✅ Use smallest integer type that fits your data
✅ Can save 50-80% memory

#### Filtering Optimization:
✅ Filter early (reduce data size ASAP)
✅ Create single boolean mask (not multiple)
✅ Apply most selective filters first
✅ Use `.query()` for complex conditions

#### Operation Optimization:
✅ Use vectorized operations always
✅ Avoid `.apply()` when possible
✅ Never use Python loops on large data
✅ Use `pd.cut()`, `np.select()`, `np.where()` for categorization

#### Memory Management:
✅ Avoid unnecessary `.copy()` operations
✅ Delete intermediate DataFrames not needed
✅ Consider chunking for very large datasets
✅ Monitor memory usage with `.memory_usage(deep=True)`

#### Development Workflow:
✅ Always benchmark different approaches
✅ Profile code to find bottlenecks
✅ Test with production-size data
✅ Document performance considerations

## Key Takeaways

### Performance:
1. **Vectorized operations are 10-100x faster** than loops/apply
2. `.query()` provides good balance of speed and readability
3. Filter early and with most selective conditions first
4. Single boolean mask is more efficient than multiple filters

### Memory:
1. Data type optimization can save **50-80% memory**
2. Category dtype for string columns is essential
3. Avoid creating unnecessary intermediate DataFrames
4. Single boolean mask uses minimal memory

### Production Tips:
1. Always benchmark with production-size data
2. Profile to identify actual bottlenecks (don't guess!)
3. Document performance characteristics
4. Set up monitoring for long-running processes

---

## Week 8 Python Series Complete!

You've now mastered:
- **Part 1:** Complex boolean filtering (`&`, `|`, `~`, `.isin()`)
- **Part 2:** SQL-like filtering with `.query()`
- **Part 3:** Performance optimization for production datasets

### What's Next?
Practice these techniques on the Week 8 exercises using real Olist e-commerce data!

### Real-World Application:
These optimization techniques are essential for:
- Production data pipelines
- Real-time analytics dashboards
- Large-scale customer analysis
- E-commerce recommendation systems
- Business intelligence reporting

**Remember:** Premature optimization is the root of all evil, but optimization for production scale is essential! Always measure before and after.