# üß© Add-on Module 1: Performance Optimization & Memory Efficiency
**Level:** Advanced

---
## üéØ Learning Objectives
In this module, you will:
- Understand how Pandas‚Äô internal data structures impact performance  
- Learn to **profile**, **optimize**, and **accelerate** DataFrame operations  
- Explore **vectorization**, **categorical encoding**, and **in-place updates**  
- Compare performance across optimization strategies  
- Apply these techniques to a **real-world retail dataset (~5M rows)**  

---
## üß† Why Optimization Matters
Pandas is incredibly powerful but also **memory-bound** and **single-threaded**.
Without optimization, operations on large DataFrames can become slow or even crash due to RAM exhaustion.

With efficient use of **data types**, **vectorization**, and **in-place updates**, you can often make pipelines **10x faster** and **10x smaller in memory footprint**.

## üîπ 1.1 Measuring Memory Usage

In [None]:
import pandas as pd
import numpy as np

# Create a large synthetic dataset
N = 1_000_000
df = pd.DataFrame({
    'user_id': np.random.randint(1, 100_000, size=N),
    'age': np.random.randint(18, 70, size=N),
    'city': np.random.choice(['New York', 'Paris', 'Berlin', 'Tokyo', 'Delhi'], size=N),
    'spend': np.random.uniform(10.0, 1000.0, size=N)
})

df.info(memory_usage='deep')

üí° `memory_usage='deep'` provides a full estimate including Python objects like strings.
Next, we‚Äôll optimize these columns for better memory efficiency.

## üîπ 1.2 Optimizing Data Types

In [None]:
# Convert 'city' to category
df['city'] = df['city'].astype('category')

# Downcast numeric columns
df['user_id'] = pd.to_numeric(df['user_id'], downcast='unsigned')
df['age'] = pd.to_numeric(df['age'], downcast='unsigned')
df['spend'] = pd.to_numeric(df['spend'], downcast='float')

# Compare memory usage
optimized_memory = df.memory_usage(deep=True).sum() / 1024**2
print(f'Optimized Memory Usage: {optimized_memory:.2f} MB')

### ‚úÖ Best Practices
- Convert `object` columns to `category` when there are repeated strings.
- Use `downcast` to choose the smallest numeric dtype that fits the data.
- Avoid `float64` unless high precision is essential.
- Store timestamps in `datetime64[ns]` for efficient arithmetic and filtering.

## üîπ 1.3 Vectorization vs. Loops

Pandas is built on top of **NumPy**, so vectorized operations are much faster than Python loops.

In [None]:
import time

def loop_method(df):
    result = []
    for s in df['spend']:
        result.append(s * 1.05)
    df['spend_taxed_loop'] = result

def vectorized_method(df):
    df['spend_taxed_vec'] = df['spend'] * 1.05

# Benchmark
start = time.time()
loop_method(df.copy())
print(f'Loop Time: {time.time() - start:.4f}s')

start = time.time()
vectorized_method(df.copy())
print(f'Vectorized Time: {time.time() - start:.4f}s')

üß© **Result:** Vectorized operations can be **50‚Äì200x faster** than loops, since they use NumPy‚Äôs C-level backend.

## üîπ 1.4 Using `eval()` and `query()` for Faster Computations

Pandas provides `eval()` and `query()` for compiling and executing expressions in C, improving performance and reducing memory overhead.

In [None]:
sales = pd.DataFrame({
    'price': np.random.uniform(5, 500, size=1_000_000),
    'quantity': np.random.randint(1, 10, size=1_000_000)
})

# Regular computation
%timeit sales['total'] = sales['price'] * sales['quantity']

# Using eval()
%timeit sales.eval('total = price * quantity', inplace=True)

‚öôÔ∏è `eval()` and `query()` are best used for:
- Large DataFrames with repetitive arithmetic
- Expressions with multiple columns
- Cases where temporary DataFrames are expensive to create

## üîπ 1.5 Real-world Case Study: Retail Transactions

In [None]:
rows = 5_000_000
retail = pd.DataFrame({
    'transaction_id': np.arange(rows),
    'customer_id': np.random.randint(1, 500_000, rows),
    'country': np.random.choice(['US', 'UK', 'DE', 'IN', 'AU'], rows),
    'amount': np.random.uniform(10, 1000, rows),
    'tax_rate': np.random.uniform(0.05, 0.18, rows)
})

print(f'Memory Before: {retail.memory_usage(deep=True).sum() / 1024**2:.2f} MB')

# Optimize dtypes
retail['country'] = retail['country'].astype('category')
retail['customer_id'] = pd.to_numeric(retail['customer_id'], downcast='unsigned')
retail['amount'] = pd.to_numeric(retail['amount'], downcast='float')
retail['tax_rate'] = pd.to_numeric(retail['tax_rate'], downcast='float')

print(f'Memory After: {retail.memory_usage(deep=True).sum() / 1024**2:.2f} MB')

‚úÖ Memory usage can often be reduced by **3‚Äì4x** simply through categorical encoding and numeric downcasting.

## üîπ 1.6 Profiling and Benchmarking

In [None]:
from time import perf_counter

start = perf_counter()
retail.eval('total = amount + (amount * tax_rate)', inplace=True)
end = perf_counter()

print(f'Execution Time: {end - start:.3f} seconds')

---
## üß© Challenge: Optimize a Customer Dataset

You are given a CSV file with the following columns:
`customer_id`, `gender`, `region`, `income`, `purchases`

Tasks:
1. Load and inspect memory usage.  
2. Convert optimal data types (`region ‚Üí category`, `income ‚Üí float32`).  
3. Compare runtime of computing average income using:
   - A loop  
   - Vectorized `groupby()`  
4. Report memory and performance improvements.

üí° Hint: Use `df.memory_usage(deep=True)` and `%timeit` for benchmarking.

## üìò Summary
- ‚úÖ Use **categories** and **downcasting** for memory efficiency.  
- ‚úÖ Avoid Python loops ‚Äî prefer **vectorized** and **eval()** operations.  
- ‚úÖ Profile before optimizing ‚Äî use `%timeit`, `perf_counter()`, and `df.info()`.  
- ‚úÖ Optimization = more rows processed, less RAM used, faster pipelines.

---
### üöÄ Next Module ‚Üí Parallelization & Scaling with Dask
Learn how to scale Pandas computations across multiple CPU cores or even clusters.