# Optional Add-on Module 4: Profiling, Benchmarking & Scaling Pandas Pipelines

In this final advanced section, we focus on **diagnosing and improving Pandas performance**.
You'll learn to:
- Profile CPU, memory, and I/O usage.
- Benchmark operations across alternative implementations.
- Scale Pandas pipelines using chunking, Dask, and parallel tools.
- Apply caching, lazy loading, and schema optimization for production pipelines.

## 1. Profiling Pandas Performance

Before optimizing, you must identify **bottlenecks**. Pandas offers several ways to measure performance — from built-in timing to third-party profilers.

In [ ]:
import pandas as pd
import numpy as np
import time

# Generate synthetic dataset
df = pd.DataFrame({
    'A': np.random.randint(1, 100, 1_000_000),
    'B': np.random.randn(1_000_000)
})

# Example operation
start = time.time()
df['C'] = df['A'] * np.log1p(df['B'].abs())
print(f"Execution Time: {time.time() - start:.3f}s")

### Using `%timeit` and `memory_usage()`
For quick benchmarking in Jupyter, use `%timeit`. For deeper insights, track memory consumption.

In [ ]:
%timeit df['A'] * np.log1p(df['B'].abs())
print('Memory usage (MB):', df.memory_usage(deep=True).sum() / (1024**2))

## 2. Advanced Profiling with `line_profiler` and `memory_profiler`

For detailed line-by-line CPU and memory analysis, use specialized profilers.

In [ ]:
!pip install -q line_profiler memory_profiler
from memory_profiler import memory_usage
import numpy as np

def compute(df):
    df['scaled'] = (df['A'] ** 2 + np.sqrt(df['B'].abs())) / np.log1p(df['A'])
    return df

mem_before = memory_usage()[0]
df = compute(df)
mem_after = memory_usage()[0]
print(f"Memory delta: {mem_after - mem_before:.2f} MB")

### Under the Hood
- **line_profiler** measures per-line execution time.
- **memory_profiler** hooks into Python’s garbage collector.
- Pandas operations release the GIL in most numeric cases but not for Python objects (dtype=object).

## 3. Benchmarking Strategies

Systematically compare different implementations of a task — e.g., loops vs vectorization vs Dask.

In [ ]:
import dask.dataframe as dd

# Loop-based computation
def loop_sum(df):
    return [x + y for x, y in zip(df['A'], df['B'])]

# Vectorized computation
def vector_sum(df):
    return df['A'] + df['B']

# Dask-based computation
def dask_sum(df):
    ddf = dd.from_pandas(df, npartitions=8)
    return ddf['A'] + ddf['B']

# Timing comparison
import time
for func in [loop_sum, vector_sum, dask_sum]:
    start = time.time()
    _ = func(df)
    print(f"{func.__name__:<12} : {time.time() - start:.3f}s")

✅ **Observation:** Vectorized and Dask-based implementations outperform explicit loops dramatically.

**Tip:** Always benchmark using realistic data volumes and compute environments.

## 4. Scaling Patterns

Scaling Pandas involves either **vertical scaling (optimization)** or **horizontal scaling (distribution)**.

### 4.1 Chunk Processing (Streaming Large Files)
For files larger than memory, load data in chunks.

In [ ]:
chunk_iter = pd.read_csv('large_data.csv', chunksize=500_000)
aggregated = []

for chunk in chunk_iter:
    summary = chunk.groupby('category')['value'].mean()
    aggregated.append(summary)

final_df = pd.concat(aggregated)
print(final_df.head())

### 4.2 Parallel Execution with Swifter / Modin
Distribute `apply()` across cores for compute-heavy transforms.

In [ ]:
!pip install -q swifter
import swifter

def complex_transform(x):
    return np.sin(x) * np.log1p(abs(x))

df['transformed'] = df['B'].swifter.apply(complex_transform)
df.head()

### 4.3 Dask for Cluster-scale Scaling
Dask extends Pandas syntax to distributed computing with minimal changes.

In [ ]:
ddf = dd.from_pandas(df, npartitions=8)
result = ddf.groupby('A')['B'].mean().compute()
print(result.head())

## 5. Real-World Problem Examples

### Problem 1: Benchmarking ETL Pipeline
You’re tasked to migrate a legacy data-cleaning script. Measure performance difference:
1. Original loop-based version.
2. Vectorized Pandas version.
3. Dask parallelized version.

Collect time and memory profiles using `timeit` and `memory_profiler`.

### Problem 2: Scaling a Daily Aggregation Job
An e-commerce dataset (10GB) must produce daily revenue summaries.
Use **chunking + Dask** to scale the process incrementally.

```python
chunks = pd.read_csv('orders.csv', chunksize=1_000_000)
for c in chunks:
    daily = c.groupby('date')['revenue'].sum()
    daily.to_csv('daily_summaries.csv', mode='a')
```

This minimizes memory footprint and improves throughput.

## Best Practices / Pitfalls

✅ **Best Practices:**
- Always **profile first**, optimize later.
- Use **vectorization** as your default pattern.
- Apply **Dask or Swifter** when memory or CPU is the limiting factor.
- Cache intermediate results if reused across stages.

⚠️ **Pitfalls:**
- Excessive chunk sizes can overload memory.
- Dask computations require `.compute()` — forgetting it leaves graphs unexecuted.
- Profilers add overhead; use them for diagnosis, not production.

## Challenge Exercise

**Task:** Build a profiling dashboard that:**
1. Loads 2M synthetic records.
2. Runs 3 operations (merge, apply, groupby).
3. Profiles time and memory for each version — Pandas vs Dask.
4. Summarize results as a performance comparison DataFrame.

_Hint_: Use `%timeit`, `memory_usage()`, and Dask partitions.

# --- End of Add-on Module Section 4 ---