# Level 11: Performance & Optimization

As your datasets grow, performance becomes increasingly important. Writing efficient Pandas code can save you time and memory. This level covers techniques to make your data manipulation faster and more memory-efficient.

In [1]:
import pandas as pd
import numpy as np

## 11.1 Memory Efficiency

### Using the `category` dtype
If you have a column with a limited number of repeated string values (e.g., country, department), converting it to the `category` dtype can save a lot of memory.

In [2]:
size = 1_000_000
departments = ['HR', 'Engineering', 'Sales', 'Marketing']
df = pd.DataFrame({
    'department': np.random.choice(departments, size=size)
})

print(f"Memory usage with object dtype: {df['department'].memory_usage(deep=True) / 1e6:.2f} MB")

df['department'] = df['department'].astype('category')
print(f"Memory usage with category dtype: {df['department'].memory_usage(deep=True) / 1e6:.2f} MB")

Memory usage with object dtype: 63.75 MB
Memory usage with category dtype: 1.00 MB


### Downcasting Numeric Types
If your numeric data fits into a smaller integer or float type (e.g., `int8` instead of `int64`), you can downcast it to save memory.

In [3]:
df_num = pd.DataFrame({'a': np.random.randint(0, 100, size=size)})
print(f"Original memory usage: {df_num['a'].memory_usage(deep=True) / 1e6:.2f} MB")

# pd.to_numeric can downcast automatically
df_num['a_downcast'] = pd.to_numeric(df_num['a'], downcast='integer')
print(f"Downcasted memory usage: {df_num['a_downcast'].memory_usage(deep=True) / 1e6:.2f} MB")
print(f"New dtype: {df_num['a_downcast'].dtype}")

Original memory usage: 4.00 MB
Downcasted memory usage: 1.00 MB
New dtype: int8


## 11.2 Vectorization

Vectorization is the practice of applying operations to whole arrays instead of iterating over them element by element. Pandas operations are vectorized, meaning they are much faster than using Python loops.

In [4]:
df_perf = pd.DataFrame(np.random.rand(10000, 3), columns=['A', 'B', 'C'])

# Inefficient: Using a loop
def custom_sum_loop(df):
    total = 0
    for i in range(len(df)):
        total += df['A'][i] + df['B'][i]
    return total

# Efficient: Vectorized operation
def custom_sum_vectorized(df):
    return (df['A'] + df['B']).sum()

%timeit custom_sum_loop(df_perf)
%timeit custom_sum_vectorized(df_perf)

110 ms ± 9.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 μs ± 5.89 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### `.where()`, `.mask()`, `.clip()`
These are vectorized alternatives to `if/else` logic.
- **`.where(cond, other)`**: Where `cond` is `True`, keep the original value. Otherwise, replace with `other`.
- **`.mask(cond, other)`**: Where `cond` is `True`, replace with `other`. Otherwise, keep the original value. (Opposite of `where`)
- **`.clip(lower, upper)`**: Trim values at specified lower and/or upper bounds.

In [5]:
s = pd.Series(range(5))
# Replace values less than 3 with -1
s.where(s >= 3, -1)

0   -1
1   -1
2   -1
3    3
4    4
dtype: int64

In [6]:
# Replace values greater than 2 with 10
s.mask(s > 2, 10)

0     0
1     1
2     2
3    10
4    10
dtype: int64

In [7]:
# Clip values to be between 1 and 3
s.clip(1, 3)

0    1
1    1
2    2
3    3
4    3
dtype: int64

## 11.3 Method Chaining

Method chaining is a style of programming where you call methods on an object one after another. This can lead to clean, readable code by avoiding the creation of intermediate variables.

In [8]:
df_chain = pd.DataFrame(data)

# Without chaining
df1 = df_chain.dropna()
df2 = df1.query('Salary > 80000')
df3 = df2.assign(Salary_k = df2['Salary'] / 1000)
result = df3[['Employee', 'Salary_k']]

# With method chaining
result_chained = (
    df_chain
    .dropna()
    .query('Salary > 80000')
    .assign(Salary_k = lambda df: df['Salary'] / 1000)
    [['Employee', 'Salary_k']]
)

result_chained

NameError: name 'data' is not defined

## 11.4 Using `eval()` and `query()` for Large Datasets

For large DataFrames, `pd.eval()` and `df.query()` can be faster than standard Python expressions because they use the `numexpr` library in the background, which can perform operations more efficiently.

In [None]:
# You may need to install numexpr: !pip install numexpr
df_large = pd.DataFrame(np.random.rand(1_000_000, 4), columns=['A', 'B', 'C', 'D'])

%timeit df_large['A'] + df_large['B'] > 0.5
%timeit pd.eval("df_large['A'] + df_large['B'] > 0.5")
%timeit df_large.query("A + B > 0.5")

## 11.5 When to Use Alternatives

While Pandas is incredibly powerful, it's not always the best tool for every job, especially with very large datasets (that don't fit in memory).

### Modin
- **What it is:** A library that wraps Pandas and parallelizes its operations across all available CPU cores.
- **When to use:** When your dataset is large (e.g., 1GB+) and your operations are taking too long. It has a nearly identical API to Pandas, so changes are minimal (`import modin.pandas as pd`).

### Polars
- **What it is:** A completely separate DataFrame library built from the ground up in Rust. It's known for its speed and memory efficiency.
- **When to use:** When you need the absolute best performance, especially for large datasets and complex queries. It has its own distinct API that you would need to learn, but it is often significantly faster than Pandas.