# Practical Optimisations for Pandas 🐼
## Eyal Trabelsi

# About Me 🙈


- Software Engineer at Salesforce 👷

- Big passion for python, data and performance optimisations 🐍🤖

- Online at [medium](https://medium.com/@Eyaltra) | [twitter](https://twitter.com/eyaltra) 🌐

# Optimizing Your Pandas is not Rocket Science 🚀

# Optimization ?! Why ?🤨

- Fast is better than slow 🐇


- latency response time 200 milliseconds client roundtrip
- throughput successful traffic flow of 200 requests per seconds


- Memory efficiency is good 💾


- Saving money is awesome [💸](https://aws.amazon.com/ec2/pricing/on-demand/)


- Hardware will only take you so far 💻

- Ok now that i have got you attention, the next question i want to tackle is when should we optimize our code

# Before We Optimize ⏰

- It's actually needed 🚔


#### remember optimized code is:
- harder to write and read
- less maintainable
- buggier, more brittle

#### Optimize when
- gather requirements, there are some parts you won't be able to touch
- establish percentile SLAs: 50, 95, 99 max

- Our code is well tested 💯

- [Focus on the bottlenecks](https://www.youtube.com/watch?v=9wfFXRCkkLE) 🍾

- I have a 45 minute talk on how to properly profile code, in this talk i give u a glimp

# Profiling 📍

- **timeit** -  Benchmark multiple runs of the code snippet and measure CPU ⌛

- **memit** - Measures process Memory 💾

# Dataset 📉

In [None]:
! pip install numba numexpr

In [1]:
import math
import time
import warnings
from dateutil.parser import parse

import janitor
import numpy as np
import pandas as pd
from numba import jit
from sklearn import datasets
from pandas.api.types import is_datetime64_any_dtype as is_datetime

In [2]:
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)
pd.options.display.max_columns = 999

In [3]:
path = 'https://raw.githubusercontent.com/FBosler/you-datascientist/master/invoices.csv'

def load_dataset(naivly=False):
    df = (pd.concat([pd.read_csv(path)
                       .clean_names()
                       .remove_columns(["meal_id", "company_id"]) 
                     for i in range(20)])
            .assign(meal_tip=lambda x: x.meal_price.map(lambda x: x * 0.2))
            .astype({"meal_price": int})
            .rename(columns={"meal_price": "meal_price_with_tip"}))

    if naivly:
        for col in df.columns:
            df[col] = df[col].astype(object)
    return df

In [37]:
df = load_dataset()
df.head()

Unnamed: 0,order_id,date,date_of_meal,participants,meal_price_with_tip,type_of_meal,heroes_adjustment,meal_tip
0,839FKFW2LLX4LMBB,2016-05-27,2016-05-31 07:00:00+02:00,['David Bishop'],469,Breakfast,False,93.8
1,97OX39BGVMHODLJM,2018-09-27,2018-10-01 20:00:00+02:00,['David Bishop'],22,Dinner,False,4.4
2,041ORQM5OIHTIU6L,2014-08-24,2014-08-23 14:00:00+02:00,['Karen Stansell'],314,Lunch,False,62.8
3,YT796QI18WNGZ7ZJ,2014-04-12,2014-04-07 21:00:00+02:00,['Addie Patino'],438,Dinner,False,87.6
4,6YLROQT27B6HRF4E,2015-07-28,2015-07-27 14:00:00+02:00,['Addie Patino' 'Susan Guerrero'],690,Lunch,False,138.0


# How 👀

# Use What You Need 🧑

- Keep needed columns only

- Keep needed rows only

# Dont Reinvent the Wheel 🎡

- Vast ecosystem

- Use existing solutions

- Fewer bugs

- Highly optimized

# Avoid Loops ♾

### Bad Option 😈

In [40]:
import warnings
warnings.filterwarnings("ignore")

In [38]:
def iterrows_original_meal_price(df):
    for i, row in df.iterrows():
        df.loc[i]["original_meal_price"] = row["meal_price_with_tip"] - row["meal_tip"]
    return df

In [41]:
%%timeit -r 1 -n 1
iterrows_original_meal_price(df)

21min 50s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Better Option 🤵

In [5]:
def apply_original_meal_price(df):
    df["original_meal_price"] = df.apply(lambda x: x['meal_price_with_tip'] - x['meal_tip'], axis=1)
    return df

In [6]:
%%timeit 
apply_original_meal_price(df)

8.27 s ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 150x Improvement In Execution Time ⌛

### Best Option 👼

In [7]:
def vectorized_original_meal_price(df):
    df["original_meal_price"] = df["meal_price_with_tip"] - df["meal_tip"] 
    return df

In [8]:
%%timeit 
vectorized_original_meal_price(df)

4.05 ms ± 86.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Another 2000x Improvement In Execution Time ⌛

- **pandas vectorized functions**: +, -, .str.lower(), .str.strip(), .dt.second and more

- **numpy vectorized functions**: np.log, np.log, np.divide, np.subtract, np.where, and more

- **scipy vectorized functions**: scipy.special.gamma, scipy.special.beta and more

- **np.vectorize**

# Picking the Right Type 🌈

## Motivation 🏆

In [9]:
ones = np.ones(shape=5000)
ones 

array([1., 1., 1., ..., 1., 1., 1.])

In [10]:
types = ['object', 'complex128', 'float64', 'int64', 'int32', 'int16', 'int8', 'bool']
df = pd.DataFrame(dict([(t, ones.astype(t)) for t in types]))
df.memory_usage(index=False, deep=True)

object        160000
complex128     80000
float64        40000
int64          40000
int32          20000
int16          10000
int8            5000
bool            5000
dtype: int64

## Supported Types 🌈

- int64 / float64


- bool

- objects


- datetime64 / timedelta


- [Category](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

- [Sparse Types](https://pandas.pydata.org/docs/user_guide/sparse.html)


- [Nullable Integer](https://pandas.pydata.org/docs/user_guide/integer_na.html)/[Nullable Bolean](https://pandas.pydata.org/docs/user_guide/boolean.html)


- [Your Own Types](https://www.youtube.com/watch?v=xx7H5EkzQH0)

- Open Sourced Types like [cyberpandas](https://github.com/ContinuumIO/cyberpandas) and [geopandas](https://github.com/geopandas/geopandas) 

## Where We Stand 🌈

In [11]:
df = load_dataset(naivly=True)

In [12]:
df.memory_usage(deep=True).sum()

478844140

In [13]:
df.memory_usage(deep=True)

Index                   8002720
order_id               73024820
date                   67022780
date_of_meal           82027880
participants           84977580
meal_price_with_tip    36012240
type_of_meal           63688760
heroes_adjustment      32076480
meal_tip               32010880
dtype: int64

In [14]:
df.dtypes

order_id               object
date                   object
date_of_meal           object
participants           object
meal_price_with_tip    object
type_of_meal           object
heroes_adjustment      object
meal_tip               object
dtype: object

## Optimized Types 🌈

In [17]:
optimized_df = df.astype({'order_id': 'category',
                          'date': 'category',
                          'date_of_meal': 'category',
                          'participants': 'category',
                          'meal_price_with_tip': 'int16',
                          'type_of_meal': 'category',
                          'heroes_adjustment': 'bool',
                          'meal_tip': 'float32'})

In [18]:
optimized_df.memory_usage(deep=True).sum()

36349398

### 13x Improvement In Memory ⌛

## Optimized Types 🌈


- Improved operation performance 🧮

In [21]:
%%timeit
df["meal_price_with_tip"].astype(object).mean()

46.4 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [22]:
%%timeit
df["meal_price_with_tip"].astype(float).mean()

19.2 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 2.5x Performance Improvement⌛

# Recommended Installation  👨‍🏫

- [numexpr](https://pypi.org/project/numexpr/) - Fast numerical expression evaluator for NumPy

- [bottleneck](https://github.com/pydata/bottleneck) -  uses specialized nan aware Cython routines to achieve large speedups. 

- Better for medium to big datasets

# Compiled Code 🤯

- Python dynamic nature 

- No compilation optimization

- Pure Python can be slow 

In [23]:
def foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

In [24]:
%%timeit
df.meal_price_with_tip.map(foo)

14.6 s ± 279 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Cython and Numba for the rescue 👨‍🚒

## Cython 🤯

- Up to 100x speedup from pure python 👍


- Learning Curve 👎

- Separated Compilation Step 👎 👍


In [25]:
%load_ext Cython

## Example 

In [26]:
%%cython
def cython_foo(long N):
    cdef long accumulator
    accumulator = 0

    cdef long i
    for i in range(N):
        accumulator += i

    return accumulator

In [27]:
%%timeit
df.meal_price_with_tip.map(cython_foo)

156 ms ± 4.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 100x  Performance Improvement⌛

## Numba 🤯

- Up to 200x speedup from pure python  👍


- Easy 👍


using numba is really easy its simply adding a decorator to a method

- Highly Configurable - fastmath, parallel, nogil 👍

- Mostly Numeric 👎 

## Example 

In [28]:
@jit(nopython=True)
def numba_foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

In [29]:
%%timeit
df.meal_price_with_tip.map(numba_foo)

218 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 65x  Performance Improvement⌛

### 1️⃣ Vectorized methods 

### 2️⃣ Numba 

### 3️⃣ Cython 

# General Python Optimizations 🐍

## Caching 🏎


- Avoid unnecessary work/computation.

- Faster code 

- functools.lru_cache

## Intermediate Variables👩‍👩‍👧‍👧

- Intermediate calculations 

- Memory foot print of both objects

- Smarter variables allocation 

In [32]:
def another_foo(data):
    return data * 2

def foo(data):
    return data + 10


## Example

In [33]:
%reload_ext memory_profiler

In [34]:
def load_data():
    return np.ones((2 ** 30), dtype=np.uint8)

In [35]:
%%memit
def proccess():
    data = load_data()
    data2 = foo(data)
    data3 = another_foo(data2)
    return data3

proccess()

peak memory: 4794.75 MiB, increment: 2905.10 MiB


In [36]:
%%memit
def proccess():
    data = load_data()
    data = foo(data)
    data = another_foo(data)
    return data

proccess()

peak memory: 3790.43 MiB, increment: 1900.74 MiB


## Concurrency And Parallelism 🎸🎺🎻🎷

- pandas methods use a single process

- CPU-bound can benefit parallelism

- IO-bound can benefit either parallelism or concurrency

# How 👀

- Use What You Need 💾⌛

- Dont Reinvent the Wheel ⌛💾

- Avoid Loops ⌛

- Picking the Right Types 💾⌛

- Recommended Installation ⌛💾

- Compiled Code ⌛

- General Python Optimizations ⌛💾

# Optimizing Your Pandas is not Rocket Science 🚀

![](https://i.pinimg.com/originals/b9/0a/79/b90a79b4c361d079144597d0bcdd61de.jpg)