# üìò Section 10: Performance Optimization & Integration with Other Libraries

**Level:** Advanced / Add-on Module

Pandas is highly capable but can become slow when dealing with very large datasets. This section explores techniques to **optimize performance**, **profile bottlenecks**, and **integrate Pandas with other high-performance libraries** like NumPy, Polars, and SQLAlchemy.

We'll cover:
- Profiling Pandas performance
- Vectorization & avoiding Python loops
- Memory optimization techniques
- Parallel processing with Dask
- Integration with NumPy, Polars, and databases
- Real-world scalability examples

---

## üîπ 10.1 Profiling Pandas Performance

Use the `%%timeit` Jupyter magic or the `time` module to identify slow operations.

In [None]:
import pandas as pd
import numpy as np
import time

# Create a large dataset
N = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 100, size=N),
    'B': np.random.randn(N),
    'C': np.random.choice(['X', 'Y', 'Z'], size=N)
})

# Time inefficient vs efficient operations
start = time.time()
df['A_squared_loop'] = [x**2 for x in df['A']]
print('Loop time:', round(time.time() - start, 3), 's')

start = time.time()
df['A_squared_vec'] = df['A']**2
print('Vectorized time:', round(time.time() - start, 3), 's')

‚úÖ **Takeaway:** Always prefer **vectorized NumPy-style operations** over Python loops for scalability.

## üîπ 10.2 Memory Optimization

Reducing memory footprint is critical when working with large DataFrames. Pandas offers utilities to **downcast numeric types** and **convert object columns to categories**.

In [None]:
df_small = df.copy()

# Convert integer columns to smaller dtype
df_small['A'] = pd.to_numeric(df_small['A'], downcast='unsigned')

# Convert float columns
df_small['B'] = pd.to_numeric(df_small['B'], downcast='float')

# Convert categorical columns
df_small['C'] = df_small['C'].astype('category')

print('Memory usage before:', round(df.memory_usage(deep=True).sum() / 1e6, 2), 'MB')
print('Memory usage after:', round(df_small.memory_usage(deep=True).sum() / 1e6, 2), 'MB')

## üîπ 10.3 Parallel and Lazy Computation with Dask

Dask extends Pandas for **out-of-core** (too-large-for-memory) and **parallel** processing. It uses the same API, making it easy to scale up existing Pandas workflows.

In [None]:
import dask.dataframe as dd

# Convert Pandas DataFrame to Dask DataFrame
dask_df = dd.from_pandas(df, npartitions=8)

# Perform parallel groupby computation
result = dask_df.groupby('C')['A'].mean().compute()
result

‚úÖ **Takeaway:** Dask is ideal for large datasets or multi-core machines ‚Äî it parallelizes operations transparently.

## üîπ 10.4 Integration with NumPy

Pandas is built on top of NumPy ‚Äî meaning all numerical computations ultimately delegate to efficient NumPy arrays.

You can access NumPy arrays directly via `.values` or `.to_numpy()` for performance-critical operations.

In [None]:
# Example: fast numerical computation using NumPy
A_np = df['A'].to_numpy()
B_np = df['B'].to_numpy()

# Compute correlation using NumPy
correlation = np.corrcoef(A_np, B_np)[0, 1]
correlation

## üîπ 10.5 Integration with Polars for Speed

[Polars](https://pola.rs) is a high-performance DataFrame library written in Rust. It is **multi-threaded** and **lazy-evaluated**, often outperforming Pandas by 5‚Äì10x for certain workloads.

In [None]:
import polars as pl

# Convert from Pandas to Polars
pl_df = pl.from_pandas(df)

# Fast groupby operation
pl_result = pl_df.groupby('C').agg([
    pl.col('A').mean().alias('avg_A'),
    pl.col('B').max().alias('max_B')
])
pl_result

‚úÖ **Takeaway:** Polars is a great alternative for performance-heavy workloads, especially when handling millions of rows.

## üîπ 10.6 Interacting with Databases via SQLAlchemy

You can load or write large datasets directly from/to databases using `pandas.read_sql()` and `DataFrame.to_sql()` with **SQLAlchemy** for efficient I/O.

In [None]:
from sqlalchemy import create_engine

# In-memory SQLite database
engine = create_engine('sqlite://', echo=False)

# Write to SQL
df.head(1000).to_sql('sales_data', con=engine, index=False, if_exists='replace')

# Query from SQL
query_df = pd.read_sql('SELECT C, AVG(A) as avg_A, SUM(B) as sum_B FROM sales_data GROUP BY C', con=engine)
query_df

## ‚öôÔ∏è Under the Hood

- Pandas delegates numeric operations to **NumPy C-level ufuncs**.
- Dask and Polars utilize **multi-threading** and **lazy evaluation** for performance.
- SQLAlchemy provides an **ORM abstraction** over various database engines.
- Memory optimization relies on **bit width reduction** and **categorical encoding**.

---

## üíº Real-World Problem 1 ‚Äî Large Dataset Aggregation Pipeline

**Scenario:** You‚Äôre analyzing 20 million sales records for a retail company. You need to calculate monthly statistics without crashing your machine.

**Goal:**
1. Use Dask for lazy computation.
2. Compute total sales and average discount per month.
3. Export the final results to a database for reporting.

In [None]:
import dask.dataframe as dd

# Simulate large CSV (use smaller data here for demo)
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10000, freq='H'),
    'sales': np.random.randint(100, 1000, 10000),
    'discount': np.random.uniform(0.05, 0.3, 10000)
})

dask_sales = dd.from_pandas(sales, npartitions=8)

# Compute monthly metrics
monthly_summary = (
    dask_sales.assign(month=dask_sales['date'].dt.to_period('M'))
    .groupby('month')
    .agg({'sales': 'sum', 'discount': 'mean'})
    .compute()
)
monthly_summary.head()

## üåç Real-World Problem 2 ‚Äî Hybrid Workflow with Polars and Pandas

**Scenario:** You receive a 2GB CSV file. You need to preprocess it using Polars for speed, then switch to Pandas for visualization and modeling.

**Goal:** Demonstrate an efficient hybrid workflow combining both libraries.

In [None]:
# Load CSV using Polars (fast)
pl_data = pl.DataFrame({
    'product': np.random.choice(['A', 'B', 'C'], 10000),
    'revenue': np.random.randint(100, 1000, 10000)
})

# Aggregate in Polars
agg = pl_data.groupby('product').agg(pl.col('revenue').mean().alias('avg_revenue'))

# Convert to Pandas for visualization
pd_data = agg.to_pandas()
pd_data

## ‚úÖ Best Practices / Pitfalls

‚úÖ Use vectorized operations instead of loops.
‚úÖ Downcast numeric and categorical columns to reduce memory.
‚úÖ Use Dask or Polars for large data workloads.
‚ö†Ô∏è Avoid chaining many temporary DataFrames ‚Äî use in-place or pipe().
‚öôÔ∏è Profile before optimizing ‚Äî premature optimization often backfires.

---

## üí™ Challenge Exercise

**Task:** You are given a large dataset containing millions of e-commerce transactions.
1. Profile its performance bottlenecks.
2. Optimize data types to reduce memory usage by 50%.
3. Use Dask to compute monthly user purchase totals.
4. Integrate results into a SQLite database using `to_sql()`.

_Try implementing this full optimization pipeline on your own._

---
# --- End of Section 10 (Final Add-on Module) ---