# üß© Add-on Module 8: Profiling, Optimization & Best Practices for Production Pandas

In this module, we explore how to move from analysis to **production-quality Pandas pipelines**:

- Profiling slow and memory-heavy code
- Understanding the internal execution model
- Applying vectorization and parallelization
- Using efficient data formats
- Establishing reproducible, scalable patterns

We'll apply these lessons to a **real-world ETL (Extract, Transform, Load)** scenario with millions of rows.

## 1Ô∏è‚É£ Profiling Pandas Performance

Before optimizing, you must identify the bottlenecks.

### Tools for Profiling:
- `%timeit` and `%prun` in IPython/Jupyter
- `cProfile` for call-level profiling
- `memory_profiler` for RAM usage
- `line_profiler` for per-line CPU cost

In [ ]:
import pandas as pd
import numpy as np

# Generate synthetic data
n = 2_000_000
df = pd.DataFrame({
    'user_id': np.random.randint(1, 50000, n),
    'amount': np.random.uniform(10, 1000, n),
    'category': np.random.choice(['electronics', 'grocery', 'fashion', 'home'], n)
})

# Example: Profiling aggregation time
%timeit df.groupby('category')['amount'].mean()

### üîç Memory Profiling Example

In [ ]:
from memory_profiler import memory_usage

def summarize():
    return df.groupby('category')['amount'].mean()

mem_usage = memory_usage(summarize)
print(f"Memory used: {max(mem_usage) - min(mem_usage):.2f} MB")

## 2Ô∏è‚É£ Vectorization vs Loops

The biggest performance killer in Pandas is **Python-level loops**.

Use vectorized operations (powered by NumPy‚Äôs C-level speed) instead.

In [ ]:
# ‚ùå Bad: Python loop
def loop_sum(df):
    total = []
    for amt in df['amount']:
        total.append(amt * 1.18)
    df['taxed'] = total
    return df

# ‚úÖ Good: Vectorized operation
df['taxed'] = df['amount'] * 1.18
df.head()

## 3Ô∏è‚É£ Using Efficient Data Types

Reducing data types can drastically shrink memory usage.

In [ ]:
df.info(memory_usage='deep')

# Convert types for optimization
df['user_id'] = df['user_id'].astype('int32')
df['amount'] = df['amount'].astype('float32')
df['category'] = df['category'].astype('category')

df.info(memory_usage='deep')

## 4Ô∏è‚É£ I/O Optimization: Parquet & Feather

When dealing with large files, the storage format matters.

**CSV** is human-readable but slow. Prefer **binary formats** like:

- `.parquet` (columnar, compressed, great for analytics)
- `.feather` (lightweight and fast)

These formats support **predicate pushdown** and are ideal for incremental ETL.

In [ ]:
# Save as Parquet
df.to_parquet('optimized_sales.parquet', index=False)

# Reload faster than CSV
df2 = pd.read_parquet('optimized_sales.parquet')
df2.head()

## 5Ô∏è‚É£ Caching and Chunk Processing

When data is too big to fit into RAM, process it in chunks.

You can combine chunking with `HDF5`, `SQLite`, or Dask for streaming-style processing.

In [ ]:
chunk_iter = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv', chunksize=20)
summary = []
for chunk in chunk_iter:
    summary.append(chunk['tip'].mean())

print(f"Average Tip (chunked processing): {np.mean(summary):.2f}")

## 6Ô∏è‚É£ Real-World Problem 1: Financial Transaction Pipeline

**Scenario:**
A fintech company receives millions of transactions daily. You need to:

- Detect abnormal spending patterns
- Optimize ETL time from 4 hours ‚Üí under 30 minutes
- Reduce dataset memory by 80%

**Approach:**
1. Use Parquet instead of CSV
2. Convert strings to categories
3. Apply vectorized operations for fraud scoring
4. Use Dask for lazy parallel loading

## 7Ô∏è‚É£ Real-World Problem 2: Marketing Data Cleanup

**Scenario:**
Marketing data includes millions of customer events (clicks, purchases, ad views).

**Goal:** Deduplicate by (user_id, event_time), compress, and compute daily metrics efficiently.

**Approach:**
- Use `df.drop_duplicates(['user_id', 'event_time'], keep='last')`
- Convert timestamps to `datetime64[ns]`
- Cache daily summaries to Parquet
- Use memory-profiler to ensure process stability

## üß† Under the Hood

- **GroupBy & Aggregations** use hash tables internally.
- **Categorical columns** store integer codes + dictionary mapping.
- **Vectorized math** uses NumPy‚Äôs C/Fortran-level loops.
- **I/O acceleration** relies on Apache Arrow & PyArrow libraries.
- **Lazy evaluation (in Dask/Polars)** builds computation graphs before executing.

## ‚úÖ Best Practices Checklist

- [x] Use `df.info(memory_usage='deep')` to audit RAM
- [x] Prefer `.parquet` or `.feather` for I/O
- [x] Convert categorical and integer types
- [x] Avoid `apply()` loops when possible
- [x] Cache intermediate results in local disk or Arrow buffers
- [x] Profile before optimizing ‚Äî don‚Äôt guess!

## ‚ö° Challenge Exercise

You have a 3GB CSV of user analytics logs.

1. Profile load time and memory usage.
2. Reduce memory by 70% using dtype conversion.
3. Save it as Parquet and compare I/O time.
4. Implement a Dask-based version for incremental processing.
5. Visualize the top 10 users with the most events per day.